Sie sind auf Seite 1von 16

This Data Warehousing site aims to help people get a good high-level understanding of what it takes to implement a successful

data warehouse project. A lot of the information is from my personal experience as a business intelligence professional, both as a client and as a vendor. This site is divided into five main areas. - Tools: The selection of business intelligence tools and the selection of the data warehousing team. Tools covered are: Database, Hardware ETL (Extraction, Transformation, and Loading) OLAP Reporting Metadata

- Steps: This selection contains the typical milestones for a data warehousing project, from requirement gathering, query optimization, to production rollout and beyond. I also offer my observations on the data warehousing field. - Business Intelligence: Business intelligence is closely related to data warehousing. This section discusses business intelligence, as wellas the relationship between business intelligence and data warehousing. - Concepts: This section discusses several concepts particular to the data warehousing field. Topics include: Dimensional Data Model Star Schema Snowflake Schema Slowly Changing Dimension Conceptual Data Model Logical Data Model Physical Data Model Conceptual, Logical, and Physical Data Model Data Integrity What is OLAP MOLAP, ROLAP, and HOLAP Bill Inmon vs. Ralph Kimball

- Business Intelligence Conferences: Lists upcoming conferences in the business intelligence / data warehousing industry. - Glossary: A glossary of common data warehousing terms.

This site is updated frequently to reflect the latest technology, information, and reader feedback. Please bookmark this site now.

Data Warehouse Definition

Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. Ralph Kimball provided a more concise definition of a data warehouse: A data warehouse is a copy of transaction data specifically structured for query and analysis. This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse.

Data Warehouse Architecture

Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system. In general, all data warehouse systems have the following layers: Data Source Layer Data Extraction Layer Staging Area ETL Layer Data Storage Layer Data Logic Layer Data Presentation Layer Metadata Layer System Operations Layer

The picture below shows the relationships among the different components of the data warehouse architecture:

Each component is discussed individually below: Data Source Layer This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, ... can all act as a data source.

Many different types of data can be a data source: Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. Web server logs with user browsing data. Internal market research data. Third-party data, such as census data, demographics data, or survey data. All these data sources together form the Data Source Layer. Data Extraction Layer Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation. Staging Area This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. ETL Layer This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. Data Storage Layer This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types. Data Logic Layer This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like. Data Presentation Layer This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others. Metadata Layer

This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer. System Operations Layer This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

Dimensional Data Model Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model. To understand dimensional data modeling, let's define some of the terms commonly used in this Dimensional Data Model Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model. To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling

Dimension: A category of information. For example, the time dimension. Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension. Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day. Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1"). A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables. In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema. Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level. Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model. To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling Fact And Fact Table Types

Types of Facts There are three types of facts: Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table. Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:

Date Store Product Sales_Amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week. Say we are a bank with the following fact table: Date Account Current_Balance Profit_Margin The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level. Types of Fact Tables Based on the above classifications, there are two types of fact tables: Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table. Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Data Warehouse Design

After the tools and team personnel selections are made, the data warehouse design can begin. The following are the typical steps involved in the datawarehousing project cycle. Requirement Gathering Physical Environment Setup Data Modeling ETL OLAP Cube Design Front End Development Report Development Performance Tuning Query Optimization Quality Assurance Rolling out to Production Production Maintenance Incremental Enhancements

Each page listed above represents a typical data warehouse design phase, and has several sections: Task Description: This section describes what typically needs to be accomplished during this particular data warehouse design phase. Time Requirement: A rough estimate of the amount of time this particular data warehouse task takes. Deliverables: Typically at the end of each data warehouse task, one or more documents are produced that fully describe the steps and results of that particular task. This is especially important for consultants to communicate their results to the clients. Possible Pitfalls: Things to watch out for. Some of them obvious, some of them not so obvious. All of them are real.

The Additional Observations section contains my own observations on data warehouse processes not included in any of the design steps. Requirement Gathering

Task Description
The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one

meetings or as Joint Application Development (JAD) sessions, where multiple people are talking about the project scope in the same meeting. The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements. Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project. Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after the project goes live.

Time Requirement
2 - 8 weeks.

Deliverables
A list of reports / cubes to be delivered to the end users by the end of this current phase. A updated project plan that clearly identifies resource loads and milestone delivery dates.

Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing implementation. The reason is that because data warehousing by definition includes data from multiple sources spanning many different departments within the enterprise, there are often political battles that center on the willingness of information sharing. Even though a successful data warehouse benefits the enterprise, there are occasions where departments may not feel the same way. As a result of unwillingness of certain groups to release data or to participate in the data warehousing requirement definition, the data warehouse effort either never gets off the ground, or could not start in the direction originally defined. When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at the CXO level, she can often exert enough influence to make sure everyone cooperates. Data Modeling

Task Description
This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of the data warehousing system is the data model. A good data model will allow the data warehousing system to grow easily, as well as allowing for good performance. In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model. The detailed steps can be found in the Conceptual, Logical, and Physical Data Modeling section. Part of the data modeling exercise is often the identification of data sources. Sometimes this step is deferred until the ETL step. However, my feeling is that it is better to find out where the data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the data not be available, this is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much tougher and more complex process.

Time Requirement
2 - 6 weeks.

Deliverables
Identification of data sources. Logical data model. Physical data model.

Possible Pitfalls
It is essential to have a subject-matter expert as part of the data modeling team. This person can be an outside consultant or can be someone in-house who has extensive experience in the industry. Without this person, it becomes difficult to get a definitive answer on many of the questions, and the entire project gets dragged out. OLAP Cube Design

Task Description
Usually the design of the olap cube can be derived from the Requirement Gathering phase. More often than not, however, users have some idea on what they want, but it is difficult for them to specify the exact report / analysis they want to see. When this is the case, it is usually a good idea to include enough information so that they feel like they have gained something through the data warehouse, but not so much that it stretches the data warehouse scope by a mile.

Remember that data warehousing is an iterative process - no one can ever meet all the requirements all at once.

Time Requirement
1 - 2 weeks.

Deliverables
Documentation specifying the OLAP cube dimensions and measures. Actual OLAP cube / report.

Possible Pitfalls
Make sure your olap cube-bilding process is optimized. It is common for the data warehouse to be on the bottom of the nightly batch load, and after the loading of the data warehouse, there usually isn't much time remaining for the olap cube to be refreshed. As a result, it is worthwhile to experiment with the olap cube generation paths to ensure optimal performance. Front End Development

Task Description
Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannot visualize the reports, the data warehouse brings zero value to them. Hence front end development is an important part of a data warehousing initiative. So what are the things to look out for in selecting a front-end deployment methodology? The most important thing is that the reports should need to be delivered over the web, so the only thing that the user needs is the standard browser. These days it is no longer desirable nor feasible to have the IT department doing program installations on end users desktops just so that they can view reports. So, whatever strategy one pursues, make sure the ability to deliver over the web is a must. The front-end options ranges from an internal front-end development using scripting languages such as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the more higher-level products such as Actuate. In addition, many OLAP vendors offer a front-end on their own. When choosing vendor tools, make sure it can be easily customized to suit the enterprise, especially the possible changes to the reporting requirements of the enterprise. Possible changes include not just the difference in report layout and report content, but also include possible changes in the back-end structure. For example, if the enterprise decides to

change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexible enough to adjust to the changes without much modification? Another area to be concerned with is the complexity of the reporting tool. For example, do the reports need to be published on a regular interval? Are there very specific formatting requirements? Is there a need for a GUI interface so that each user can customize her reports?

Time Requirement
1 - 4 weeks.

Deliverables
Front End Deployment Documentation

Possible Pitfalls
Just remember that the end users do not care how complex or how technologically advanced your front end infrastructure is. All they care is that they receives their information in a timely manner and in the way they specified. Report Development

Task Description
Report specification typically comes directly from the requirements phase. To the end user, the only direct touchpoint he or she has with the data warehousing system is the reports they see. So, report development, although not as time consuming as some of the other steps such as ETL and data modeling, nevertheless plays a very important role in determining the success of the data warehousing project. One would think that report development is an easy task. How hard can it be to just follow instructions to build the report? Unfortunately, this is not true. There are several points the data warehousing team need to pay attention to before releasing the report. User customization: Do users need to be able to select their own metrics? And how do users need to be able to filter the information? The report development process needs to take those factors into consideration so that users can get the information they need in the shortest amount of time possible. Report delivery: What report delivery methods are needed? In addition to delivering the report to the web front end, other possibilities include delivery via email, via text messaging, or in some

form of spreadsheet. There are reporting solutions in the marketplace that support report delivery as a flash file. Such flash file essentially acts as a mini-cube, and would allow end users to slice and dice the data on the report without having to pull data from an external source. Access privileges: Special attention needs to be paid to who has what access to what information. A sales report can show 8 metrics covering the entire company to the company CEO, while the same report may only show 5 of the metrics covering only a single district to a District Sales Director. Report development does not happen only during the implementation phase. After the system goes into production, there will certainly be requests for additional reports. These types of requests generally fall into two broad categories: 1. Data is already available in the data warehouse. In this case, it should be fairly straightforward to develop the new report into the front end. There is no need to wait for a major production push before making new reports available. 2. Data is not yet available in the data warehouse. This means that the request needs to be prioritized and put into a future data warehousing development cycle.

Time Requirement
1 - 2 weeks.

Deliverables
Report Specification Documentation. Reports set up in the front end / reports delivered to user's preferred channel.

Possible Pitfalls
Make sure the exact definitions of the report are communicated to the users. Otherwise, user interpretation of the report can be errenous. Performance Tuning

Task Description
There are three major areas where a data warehousing system can use a little performance tuning: ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs

are typically of lower priority, that means that the window for data loading is not very long. A data warehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible. Query Processing - Sometimes, especially in a ROLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the query, especially the most popularly ones. We present a number of query optimization ideas. Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important for the data warehouse team to look into these areas for performance tuning.

Time Requirement
3 - 5 days.

Deliverables
Performance tuning document - Goal and Result

Possible Pitfalls
Make sure the development environment mimics the production environment as much as possible - Performance enhancements seen on less powerful machines sometimes do not materialize on the larger, production-level machines. Query Optimization

For any production database, SQL query performance becomes an issue sooner or later. Having long-running queries not only consumes system resources that makes the server and application run slowly, but also may lead to table locking and data corruption issues. So, query optimization becomes an important task. First, we offer some guiding principles for query optimization:

1. Understand how your database is executing your query Nowadays all databases have their own query optimizer, and offers a way for users to understand how a query is executed. For example, which index from which table is being used to execute the query? The first step to query optimization is understanding what the database is doing. Different databases have different commands for this. For example, in MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the query plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan. 2. Retrieve as little data as possible The more data returned from the query, the more resources the database needs to expand to process and store these data. So for example, if you only need to retrieve one column from a table, do not use 'SELECT *'. 3. Store intermediate results Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired result through the use of subqueries, inline views, and UNION-type statements. For those cases, the intermediate results are not stored in the database, but are immediately used within the query. This can lead to performance issues, especially when the intermediate results have a large number of rows. The way to increase query performance in those cases is to store the intermediate results in a temporary table, and break up the initial SQL statement into several SQL statements. In many cases, you can even build an index on the temporary table to speed up the query performance even more. Granted, this adds a little complexity in query management (i.e., the need to manage temporary tables), but the speedup in query performance is often worth the trouble. Below are several specific query optimization strategies. Use Index Using an index is the first strategy one should use to speed up a query. In fact, this strategy is so important that index optimization is also discussed. Aggregate Table Pre-populating tables at higher levels so less amount of data need to be parsed. Vertical Partitioning Partition the table by columns. This strategy decreases the amount of data a SQL query needs to process. Horizontal Partitioning Partition the table by data value, most often time. This strategy decreases the amount of data a SQL query needs to process.

Denormalization The process of denormalization combines multiple tables into a single table. This speeds up query performance because fewer table joins are needed. Server Tuning Each server has its own parameters, and often tuning server parameters so that it can fully take advantage of the hardware resources can significantly speed up query performance.

Das könnte Ihnen auch gefallen