Datawarehousing Bible

Different people have different definitions for a data warehouse.
The most popular definition came from Bill

Inmon, who provided the following:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data

in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and
source B may have different ways of identifying a product, but in a data warehouse, there will be only a
single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3
months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is kept. For example, a transaction system may
hold the most recent address of a customer, where a data warehouse can hold all addresses associated
with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like
Inmon did, rather he focused on the functionality of a data warehouse.
Different data warehousing systems have different structures. Some may have an ODS (operational data
store), while some may have multiple data marts. Some may have a small number of data sources, while
some may have dozens of data sources. In view of this, it is far more reasonable to present the different
layers of a data warehouse architecture rather than discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
 Data Source Layer

 Data Extraction Layer
 Staging Area
 ETL Layer
 Data Storage Layer
 Data Logic Layer
 Data Presentation Layer
 Metadata Layer
 System Operations Layer
The picture below shows the relationships among the different components of the data warehouse
architecture:
Each component is discussed individually below:
Data Source Layer
This represents the different data sources that feed data into the data warehouse. The data source can be
of any format -- plain text file, relational database, other types of database, Excel file, ... can all act as a
data source.
Many different types of data can be a data source:
 Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.
 Web server logs with user browsing data.
 Internal market research data.
 Third-party data, such as census data, demographics data, or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the data warehouse system. There is likely some minimal data
cleansing, but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having
one common area makes it easier for subsequent data processing / integration.
ETL Layer
This is where data gains its "intelligence", as logic is applied to transform the data from a transactional
nature to an analytical nature. This layer is also where data cleansing happens.
Data Storage Layer

This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities
can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you
may have just one of the three, two of the three, or all three types.
Data Logic Layer
This is where business rules are stored. Business rules stored here do not affect the underlying data
transformation rules, but does affect what the report looks like.
Data Presentation Layer
This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in
a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns
users of exceptions, among others.
Metadata Layer
This is where information about the data stored in the data warehouse system is stored. A logical data
model would be an example of something that's in the metadata layer.
System Operations Layer
This layer includes information on how the data warehouse system operates, such as ETL job status,
system performance, and user access history.
After the tools and team personnel selections are made, the data warehouse design can begin. The
following are the typical steps involved in the datawarehousing project cycle.
• Requirement Gathering
Task Description
The first thing that the project team should engage in is gathering requirements from end users. Because
end users are typically not familiar with the data warehousing process or concept, the help of the business
sponsor is essential. Requirement gathering can happen as one-to-one meetings or as Joint Application
Development (JAD) sessions, where multiple people are talking about the project scope in the same
meeting.
The primary goal of this phase is to identify what constitutes as a success for this particular phase of the
data warehouse project. In particular, end user reporting / analysis requirements are identified, and the
project team will spend the remaining period of time trying to satisfy these requirements.
Associated with the identification of user requirements is a more concrete definition of other details such as
hardware sizing information, training requirements, data source identification, and most importantly, a
concrete project plan indicating the finishing date of the data warehousing project.
Based on the information gathered above, a disaster recovery plan needs to be developed so that the data
warehousing system can recover from accidents that disable the system. Without an effective backup and
restore strategy, the system will only last until the first major disaster, and, as many data warehousing
DBA's will attest, this can happen very quickly after the project goes live.
Time Requirement
2 - 8 weeks.
Deliverables
• A list of reports / cubes to be delivered to the end users by the end of this current phase.
• A updated project plan that clearly identifies resource loads and milestone delivery dates.
Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing implementation. The
reason is that because data warehousing by definition includes data from multiple sources spanning many
different departments within the enterprise, there are often political battles that center on the willingness of
information sharing. Even though a successful data warehouse benefits the enterprise, there are occasions
where departments may not feel the same way. As a result of unwillingness of certain groups to release
data or to participate in the data warehousing requirement definition, the data warehouse effort either never
gets off the ground, or could not start in the direction originally defined.
When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at the CXO level,
she can often exert enough influence to make sure everyone cooperates.
• Physical Environment Setup
Task Description
Once the requirements are somewhat clear, it is necessary to set up the physical servers and databases.
At a minimum, it is necessary to set up a development environment and a production environment. There
are also many data warehousing projects where there are three environments: Development, Testing, and
Production.
It is not enough to simply have different physical environments set up. The different processes (such as
ETL, OLAP Cube, and reporting) also need to be set up properly for each environment.
It is best for the different environments to use distinct application and database servers. In other words, the
development environment will have its own application server and database servers, and the production
environment will have its own set of application and database servers.
Having different environments is very important for the following reasons:
• All changes can be tested and QA'd first without affecting the production environment.
• Development and QA can occur during the time users are accessing the data warehouse.
• When there is any question about the data, having separate environment(s) will allow the data
warehousing team to examine the data without impacting the production environment.
Time Requirement
Getting the servers and databases ready should take less than 1 week.
Deliverables
• Hardware / Software setup document for all of the environments, including hardware specifications,
and scripts / settings for the software.
Possible Pitfalls
To save on capital, often data warehousing teams will decide to use only a single database and a single
server for the different environments. Environment separation is achieved by either a directory structure or
setting up distinct instances of the database. This is problematic for the following reasons:
1. Sometimes it is possible that the server needs to be rebooted for the development environment. Having
a separate development environment will prevent the production environment from being impacted by this.
2. There may be interference when having different database environments on a single box. For example,
having multiple long queries running on the development database could affect the performance on the
production database.
• Data Modeling
Task Description
This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of
the data warehousing system is the data model. A good data model will allow the data warehousing system
to grow easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built based on user requirements, and then it is
translated into the physical data model. The detailed steps can be found in the Conceptual, Logical, and
Physical Data Modeling section.
Part of the data modeling exercise is often the identification of data sources. Sometimes this step is
deferred until the ETL step. However, my feeling is that it is better to find out where the data exists, or,
better yet, whether they even exist anywhere in the enterprise at all. Should the data not be available, this
is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much
tougher and more complex process.
Time Requirement
2 - 6 weeks.
Deliverables
• Identification of data sources.

• Logical data model.
• Physical data model.
Possible Pitfalls
It is essential to have a subject-matter expert as part of the data modeling team. This person can be an
outside consultant or can be someone in-house who has extensive experience in the industry. Without this
person, it becomes difficult to get a definitive answer on many of the questions, and the entire project gets
dragged out.
• ETL
Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop, and this can
easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it
takes time to get the source data, understand the necessary columns, understand the business rules, and
understand the logical and physical data models.
Time Requirement
1 - 6 weeks.
Deliverables
• Data Mapping Document

• ETL Script / ETL Package in the ETL tool
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove suicidal to the
project because end users will usually tolerate less formatting, longer time to run reports, less functionality
(slicing and dicing), or fewer delivered reports; one thing that they will not tolerate is wrong information.
A second common problem is that some people make the ETL process more complicated than necessary.
In ETL design, the primary goal should be to optimize load speed without sacrificing on quality. This is,
however, sometimes not followed. There are cases where the design goal is to cover all possible future
uses, whether they are practical or just a figment of someone's imagination. When this happens, ETL
performance suffers, and often so does the performance of the entire data warehousing system.
• OLAP Cube Design
Task Description
Usually the design of the olap cube can be derived from theRequirement Gathering phase. More often than
not, however, users have some idea on what they want, but it is difficult for them to specify the exact report
/ analysis they want to see. When this is the case, it is usually a good idea to include enough information
so that they feel like they have gained something through the data warehouse, but not so much that it
stretches the data warehouse scope by a mile. Remember that data warehousing is an iterative process -
no one can ever meet all the requirements all at once.
Time Requirement
1 - 2 weeks.
Deliverables
• Documentation specifying the OLAP cube dimensions and measures.

• Actual OLAP cube / report.
Possible Pitfalls
Make sure your olap cube-bilding process is optimized. It is common for the data warehouse to be on the
bottom of the nightly batch load, and after the loading of the data warehouse, there usually isn't much time
remaining for the olap cube to be refreshed. As a result, it is worthwhile to experiment with the olap cube
generation paths to ensure optimal performance.
• Front End Development
Task Description
Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannot visualize
the reports, the data warehouse brings zero value to them. Hence front end development is an important
part of a data warehousing initiative.
So what are the things to look out for in selecting a front-end deployment methodology? The most
important thing is that the reports should need to be delivered over the web, so the only thing that the user
needs is the standard browser. These days it is no longer desirable nor feasible to have the IT department
doing program installations on end users desktops just so that they can view reports. So, whatever strategy
one pursues, make sure the ability to deliver over the web is a must.
The front-end options ranges from an internal front-end development using scripting languages such as
ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the more higher-level
products such as Actuate. In addition, many OLAP vendors offer a front-end on their own. When choosing
vendor tools, make sure it can be easily customized to suit the enterprise, especially the possible changes
to the reporting requirements of the enterprise. Possible changes include not just the difference in report
layout and report content, but also include possible changes in the back-end structure. For example, if the
enterprise decides to change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be
flexible enough to adjust to the changes without much modification?
Another area to be concerned with is the complexity of the reporting tool. For example, do the reports need
to be published on a regular interval? Are there very specific formatting requirements? Is there a need for a
GUI interface so that each user can customize her reports?
Time Requirement
1 - 4 weeks.
Deliverables
Front End Deployment Documentation
Possible Pitfalls
Just remember that the end users do not care how complex or how technologically advanced your front
end infrastructure is. All they care is that they receives their information in a timely manner and in the way
they specified.
• Report Development
Task Description
Report specification typically comes directly from the requirements phase. To the end user, the only direct
touchpoint he or she has with the data warehousing system is the reports they see. So, report
development, although not as time consuming as some of the other steps such as ETL and data modeling,
nevertheless plays a very important role in determining the success of the data warehousing project.
One would think that report development is an easy task. How hard can it be to just follow instructions to
build the report? Unfortunately, this is not true. There are several points the data warehousing team need
to pay attention to before releasing the report.
User customization: Do users need to be able to select their own metrics? And how do users need to be
able to filter the information? The report development process needs to take those factors into
consideration so that users can get the information they need in the shortest amount of time possible.
Report delivery: What report delivery methods are needed? In addition to delivering the report to the web
front end, other possibilities include delivery via email, via text messaging, or in some form of spreadsheet.
There are reporting solutions in the marketplace that support report delivery as a flash file. Such flash file
essentially acts as a mini-cube, and would allow end users to slice and dice the data on the report without
having to pull data from an external source.
Access privileges: Special attention needs to be paid to who has what access to what information. A
sales report can show 8 metrics covering the entire company to the company CEO, while the same report
may only show 5 of the metrics covering only a single district to a District Sales Director.
Report development does not happen only during the implementation phase. After the system goes into
production, there will certainly be requests for additional reports. These types of requests generally fall into
two broad categories:
1. Data is already available in the data warehouse. In this case, it should be fairly straightforward to
develop the new report into the front end. There is no need to wait for a major production push before
making new reports available.
2. Data is not yet available in the data warehouse. This means that the request needs to be prioritized and
put into a future data warehousing development cycle.
Time Requirement
1 - 2 weeks.
Deliverables
• Report Specification Documentation.

• Reports set up in the front end / reports delivered to user's preferred channel.
Possible Pitfalls
Make sure the exact definitions of the report are communicated to the users. Otherwise, user interpretation
of the report can be errenous.
• Performance Tuning
Task Description
There are three major areas where a data warehousing system can use a little performance tuning:
• ETL - Given that the data load is usually a very time-consuming process (and hence they are
typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically of
lower priority, that means that the window for data loading is not very long. A data warehousing
system that has its ETL process finishing right on-time is going to have a lot of problems simply
because often the jobs do not get started on-time due to factors that is beyond the control of the
data warehousing team. As a result, it is always an excellent idea for the data warehousing group to
tune the ETL process as much as possible.
• Query Processing - Sometimes, especially in a ROLAP environment or in a system where the
reports are run directly against the relationship database, query performance can be an issue. A
study has shown that users typically lose interest after 30 seconds of waiting for a report to return.
My experience has been that ROLAP reports or reports that run directly against the RDBMS often
exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to
tune the query, especially the most popularly ones. We present a number of query
optimizationideas.
• Report Delivery - It is also possible that end users are experiencing significant delays in receiving
their reports due to factors other than the query performance. For example, network traffic, server
setup, and even the way that the front-end was built sometimes play significant roles. It is important
for the data warehouse team to look into these areas for performance tuning.
Time Requirement
3 - 5 days.
Deliverables
• Performance tuning document - Goal and Result

Possible Pitfalls
Make sure the development environment mimics the production environment as much as possible -
Performance enhancements seen on less powerful machines sometimes do not materialize on the larger,
production-level machines.
• Query Optimization
For any production database, SQL query performance becomes an issue sooner or later. Having long-
running queries not only consumes system resources that makes the server and application run slowly, but
also may lead to table locking and data corruption issues. So, query optimization becomes an important
task.
First, we offer some guiding principles for query optimization:
1. Understand how your database is executing your query
Nowadays all databases have their own query optimizer, and offers a way for users to understand how a
query is executed. For example, which index from which table is being used to execute the query? The first
step to query optimization is understanding what the database is doing. Different databases have different
commands for this. For example, in MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the
query plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan.
2. Retrieve as little data as possible
The more data returned from the query, the more resources the database needs to expand to process and
store these data. So for example, if you only need to retrieve one column from a table, do not use 'SELECT
*'.
3. Store intermediate results
Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired result
through the use of subqueries, inline views, and UNION-type statements. For those cases, the
intermediate results are not stored in the database, but are immediately used within the query. This can
lead to performance issues, especially when the intermediate results have a large number of rows.
The way to increase query performance in those cases is to store the intermediate results in a temporary
table, and break up the initial SQL statement into several SQL statements. In many cases, you can even
build an index on the temporary table to speed up the query performance even more. Granted, this adds a
little complexity in query management (i.e., the need to manage temporary tables), but the speedup in
query performance is often worth the trouble.
Below are several specific query optimization strategies.
• Use Index
Using an index is the first strategy one should use to speed up a query. In fact, this strategy is so
important that index optimization is also discussed.
• Aggregate Table
Pre-populating tables at higher levels so less amount of data need to be parsed.
• Vertical Partitioning
Partition the table by columns. This strategy decreases the amount of data a SQL query needs to
process.
• Horizontal Partitioning
Partition the table by data value, most often time. This strategy decreases the amount of data a SQL
query needs to process.
• Denormalization
The process of denormalization combines multiple tables into a single table. This speeds up query
performance because fewer table joins are needed.
• Server Tuning
Each server has its own parameters, and often tuning server parameters so that it can fully take
advantage of the hardware resources can significantly speed up query performance.
• Quality Assurance
Task Description
Once the development team declares that everything is ready for further testing, the QA team takes over.
The QA team is always from the client. Usually the QA team members will know little about data
warehousing, and some of them may even resent the need to have to learn another tool or tools. This
makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the project team
worked very hard to get everything ready for Phase 1, and everyone thought that we had met the deadline.
There was one mistake, though, the project managers failed to recognize that it is necessary to go through
the client QA process before the project can go into production. As a result, it took five extra months to
bring the project to production (the original development time had been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
• QA Test Plan
• QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
As mentioned above, usually the QA team members know little about data warehousing, and some of them
may even resent the need to have to learn another tool or tools. Make sure the QA team members get
enough education so that they can complete the testing themselves.
• Rolling out to Production
Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some may think
this is as easy as flipping on a switch, but usually it is not true. Depending on the number of end users, it
sometimes take up to a full week to bring everyone online! Fortunately, nowadays most end users access
the data warehouse over the web, making going production sometimes as easy as sending out an URL via
email.
Time Requirement
1 - 3 days.
Deliverables
• Delivery of the data warehousing system to the end users.
Possible Pitfalls
Take care to address the user education needs. There is nothing more frustrating to spend several months
to develop and QA the data warehousing system, only to have little usage because the users are not
properly trained. Regardless of how intuitive or easy the interface may be, it is always a good idea to send
the users to at least a one-day course to let them understand what they can achieve by properly using the
data warehouse.
• Production Maintenance
Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such regular backup and
crisis management becomes important and should be planned out. In addition, it is very important to
consistently monitor end user usage. This serves two purposes: 1. To capture any runaway requests so
that they can be fixed before slowing the entire system down, and 2. To understand how much users are
utilizing the data warehouse for return-on-investment calculations and future enhancement considerations.
Time Requirement
Ongoing.
Deliverables
Consistent availability of the data warehousing system to the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is essential that proper
documentation is left for those who are handling production maintenance. There is nothing more frustrating
than staring at something another person did, yet unable to figure it out due to the lack of proper
documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the data
warehouse planned, start on that as soon as possible.
• Incremental Enhancements
Task Description
Once the data warehousing system goes live, there are often needs for incremental enhancements. I am
not talking about a new data warehousing phases, but simply small changes that follow the business itself.
For example, the original geographical designations may be different, the company may originally have 4
sales regions, but now because sales are going so well, now they have 10 sales regions.
Deliverables
• Change management documentation

• Actual change to the data warehousing system
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just go ahead and make the
change in production. This is a definite no-no. Many unexpected problems will pop up if this is done. I
would very strongly recommend that the typical cycle of development --> QA --> Production be followed,
regardless of how simple the change may seem.
Additional Observations
Quick implementation time
If you add up the total time required to complete the tasks from Requirement Gathering to Rollout to
Production, you'll find it takes about 9 - 29 weeks to complete each phase of the data warehousing efforts.
The 9 weeks may sound too quick, but I have been personally involved in a turnkey data warehousing
implementation that took 40 business days, so that is entirely possible. Furthermore, some of the tasks
may proceed in parallel, so as a rule of thumb it is reasonable to say that it generally takes 2 - 6 months for
each phase of the data warehousing implementation.
Why is this important? The main reason is that in today's business world, the business environment
changes quickly, which means that what is important now may not be important 6 months from now. For
example, even the traditionally static financial industry is coming up with new products and new ways to
generate revenue in a rapid pace. Therefore, a time-consuming data warehousing effort will very likely
become obsolete by the time it is in production. It is best to finish a project quickly. The focus on quick
delivery time does mean, however, that the scope for each phase of the data warehousing project will
necessarily be limited. In this case, the 80-20 rule applies, and our goal is to do the 20% of the work that
will satisfy 80% of the user needs. The rest can come later.
Lack of collaboration with data mining efforts
Usually data mining is viewed as the final manifestation of the data warehouse. The ideal is that now
information from all over the enterprise is conformed and stored in a central location, data mining
techniques can be applied to find relationships that are otherwise not possible to find. Unfortunately, this
has not quite happened due to the following reasons:
1. Few enterprises have an enterprise data warehouse infrastructure. In fact, currently they are more likely
to have isolated data marts. At the data mart level, it is difficult to come up with relationships that cannot be
answered by a good OLAP tool.
2. The ROI for data mining companies is inherently lower because by definition, data mining will only be
performed by a few users (generally no more than 5) in the entire enterprise. As a result, it is hard to
charge a lot of money due to the low number of users. In addition, developing data mining algorithms is an
inherently complex process and requires a lot of up front investment. Finally, it is difficult for the vendor to
put a value proposition in front of the client because quantifying the returns on a data mining project is next
to impossible.
This is not to say, however, that data mining is not being utilized by enterprises. In fact, many enterprises
have made excellent discoveries using data mining techniques. What I am saying, though, is that data
mining is typically not associated with a data warehousing initiative. It seems like successful data mining
projects are usually stand-alone projects.
Industry consolidation
In the last several years, we have seen rapid industry consolidation, as the weaker competitors are
gobbled up by stronger players. The most significant transactions are below (note that the dollar amount
quoted is the value of the deal when initially announced):
• IBM purchased Cognos for $5 billion in 2007.

• SAP purchased Business Objects for $6.8 billion in 2007.
• Oracle purchased Hyperion for $3.3 billion in 2007.
• Business Objects (OLAP/ETL) purchased FirstLogic (data cleansing) for $69 million in 2006.
• Informatica (ETL) purchased Similarity Systems (data cleansing) for $55 million in 2006.
• IBM (database) purchased Ascential Software (ETL) for $1.1 billion in cash in 2005.
• Business Objects (OLAP) purchased Crystal Decisions (Reporting) for $820 million in 2003.
• Hyperion (OLAP) purchased Brio (OLAP) for $142 million in 2003.
• GEAC (ERP) purchased Comshare (OLAP) for $52 million in 2003.
For the majority of the deals, the purchase represents an effort by the buyer to expand into other areas of
data warehousing (Hyperion's purchase of Brio also falls into this category because, even though both are
OLAP vendors, their product lines do not overlap). This clearly shows vendors' strong push to be the one-
stop shop, from reporting, OLAP, to ETL.
There are two levels of one-stop shop. The first level is at the corporate level. In this case, the vendor is
essentially still selling two entirely separate products. But instead of dealing with two sets of sales and
technology support groups, the customers only interact with one such group. The second level is at the
product level. In this case, different products are integrated. In data warehousing, this essentially means
that they share the same metadata layer. This is actually a rather difficult task, and therefore not commonly
accomplished. When there is metadata integration, the customers not only get the benefit of only having to
deal with one vendor instead of two (or more), but the customer will be using a single product, rather than
multiple products. This is where the real value of industry consolidation is shown.
How to measure success
Given the significant amount of resources usually invested in a data warehousing project, a very important
question is how success can be measured. This is a question that many project managers do not think
about, and for good reason: Many project managers are brought in to build the data warehousing system,
and then turn it over to in-house staff for ongoing maintenance. The job of the project manager is to build
the system, not to justify its existence.
Just because this is often not done does not mean this is not important. Just like a data warehousing
system aims to measure the pulse of the company, the success of the data warehousing system itself
needs to be measured. Without some type of measure on the return on investment (ROI), how does the
company know whether it made the right choice? Whether it should continue with the data warehousing
investment?
There are a number of papers out there that provide formula on how to calculate the return on a data
warehousing investment. Some of the calculations become quite cumbersome, with a number of
assumptions and even more variables. Although they are all valid methods, I believe the success of the
data warehousing system can simply be measured by looking at one criteria:
How often the system is being used.
If the system is satisfying user needs, users will naturally use the system. If not, users will abandon the
system, and a data warehousing system with no users is actually a detriment to the company (since
resources that can be deployed elsewhere are required to maintain the system). Therefore, it is very
important to have a tracking mechanism to figure out how much are the users accessing the data
warehouse. This should not be a problem if third-party reporting/OLAP tools are used, since they all
contain this component. If the reporting tool is built from scratch, this feature needs to be included in the
tool. Once the system goes into production, the data warehousing team needs to periodically check to
make sure users are using the system. If usage starts to dip, find out why and address the reason as soon
as possible. Is the data quality lacking? Are the reports not satisfying current needs? Is the response time
slow? Whatever the reason, take steps to address it as soon as possible, so that the data warehousing
system is serving its purpose successfully.
Recipes for data warehousing project failure

BUSINESS INTELLIGENCE
Business intelligence is a term commonly associated with data warehousing. In fact, many of the tool
vendors position their products as business intelligence software rather than data warehousing software.
There are other occasions where the two terms are used interchangeably. So, exactly what is business
inteligence?
Business intelligence usually refers to the information that is available for the enterprise to make
decisions on. A data warehousing (or data mart) system is the backend, or the infrastructural, component
for achieving business intellignce. Business intelligence also includes the insight gained from doing data
mining analysis, as well as unstrctured data (thus the need fo content management systems). For our
purposes here, we will discuss business intelligence in the context of using a data warehouse
infrastructure.
Business intelligence tools: Tools commonly used for business intelligence.
The most common tools used for business intelligence are as follows. They are listed in the following order:
Increasing cost, increasing functionality, increasing business intelligence complexity, and decreasing
number of total users.
Excel
Take a guess what's the most common business intelligence tool? You might be surprised to find out it's
Microsoft Excel. There are several reasons for this:
1. It's relatively cheap.
2. It's commonly used. You can easily send an Excel sheet to another person without worrying whether the
recipient knows how to read the numbers.
3. It has most of the functionalities users need to display data.
In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel" functionality.
Even for home-built solutions, the ability to export numbers to Excel usually needs to be built.
Excel is best used for business operations reporting and goals tracking.
Reporting tool
In this discussion, I am including both custom-built reporting tools and the commercial reporting tools
together. They provide some flexibility in terms of the ability for each user to create, schedule, and run their
own reports. The Reporting Tool Selectionselection discusses how one should select an OLAP tool.
Business operations reporting and dashboard are the most common applications for a reporting tool.
OLAP tool
OLAP tools are usually used by advanced users. They make it easy for users to look at the data from
multiple dimensions. The OLAP Tool Selection selection discusses how one should select an OLAP tool.
OLAP tools are used for multidimensional analysis.
Data mining tool
Data mining tools are usually only by very specialized users, and in an organization, even large ones, there
are usually only a handful of users using data mining tools.
Data mining tools are used for finding correlation among different factors.
Business intelligence uses: Different forms of business intelligence.
Business intelligence usage can be categorized into the following categories:
1. Business operations reporting
The most common form of business intelligence is business operations reporting. This includes the actuals
and how the actuals stack up against the goals. This type of business intelligence often manifests itself in
the standard weekly or monthly reports that need to be produced.
2. Forecasting
Many of you have no doubt run into the needs for forecasting, and all of you would agree that forecasting is
both a science and an art. It is an art because one can never be sure what the future holds. What if
competitors decide to spend a large amount of money in advertising? What if the price of oil shoots up to
$80 a barrel? At the same time, it is also a science because one can extrapolate from historical data, so it's
not a total guess.
3. Dashboard
The primary purpose of a dashboard is to convey the information at a glance. For this audience, there is
little, if any, need for drilling down on the data. At the same time, presentation and ease of use are very
important for a dashboard to be useful.
4. Multidimensional analysis
Multidimensional analysis is the "slicing and dicing" of the data. It offers good insight into the numbers at a
more granular level. This requires a solid data warehousing / data mart backend, as well as business-
savvy analysts to get to the necessary data.
5. Finding correlation among different factors
This is diving very deep into business intelligence. Questions asked are like, "How do different factors
correlate to one another?" and "Are there significant time trends that can be leveraged/anticipated?"
DATAWAREHOUSING CONCEPTS
Several concepts are of particular importance to data warehousing. They are discussed in detail in this
section.
• Dimensional Data Model: Dimensional data model is commonly used in data warehousing
systems. This section describes this modeling technique, and the two common schema types, star
schema and snowflake schema.
Dimensional data model is most often used in data warehousing systems. This is different
from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can
imagine, the same data would then be stored differently in a dimensional model than in a 3rd
normal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in
this type of modeling:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes
within a dimension. For example, one possible hierarchy in the Time dimension is Year →
Quarter → Month → Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales
amount would be such a measure. This measure is stored in the fact table with the
appropriate granularity. For example, it can be sales amount by store by day. In this case, the
fact table would contain three columns: A date column, a store column, and a sales amount
column.
Lookup Table: The lookup table provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of the quarters
available in the data warehouse. Each row (each quarter) may have several fields, one for
the unique ID that identifies the quarter, and one or more additional fields that specifies how
that particular quarter is represented on a report (for example, first quarter of 2001 may be
represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or
more lookup tables, but fact tables do not have direct relationships to one another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key
columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used
schema types are Star Schema andSnowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and
business needs. Personally, I am partial to snowflakes, when there is a business case to
analyze the information at that particular level.
Granularity
The first step in designing a fact table is to determine thegranularity of the fact table.
By granularity, we mean the lowest level of information that will be stored in the fact table.
This constitutes two steps:
1. Determine which dimensions will be included.

2. Determine where along the hierarchy of each dimension the information will be
kept.
The determining factors usually goes back to the requirements.
Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process, because
business processes will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time,
geography, and product. This list, however, is by no means a complete list for all off-line
retailers. A supermarket with a Rewards Card program, where customers provide some
personal information in exchange for a rewards card, and the supermarket would offer lower
prices for certain items for customers who present a rewards card at checkout, will also have
the ability to track the customer dimension. Whether the data warehousing system includes
the customer dimension will then be a decision that needs to be made.
What Level Within Each Dimensions To Include
Determining which part of hierarchy the information is stored along each dimension is a bit
more tricky. This is where user requirement (both stated and possibly future) plays a major
role.
In the above example, will the supermarket wanting to do analysis along at the hourly level?
(i.e., looking at how certain products may sell by different hours of the day.) If so, it makes
sense to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is
sufficient, then 'day' can be used as the lowest level of granularity. Since the lower the level
of detail, the larger the data amount in the fact table, the granularity exercise is in essence
figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the
industry knowledge, the data warehousing team may foresee that certain requirements will be
forthcoming that may result in the need of additional details. In such cases, it is prudent for
the data warehousing team to design the fact table such that lower-level information is
included. This will avoid possibly needing to re-design the fact table in the future. On the
other hand, trying to anticipate all future requirements is an impossible and hence futile
exercise, and the data warehousing team needs to fight the urge of the "dumping the lowest
level of detail into the data warehouse" symptom, and only includes what is practically
needed. Sometimes this can be more of an art than science, and prior experience will
become invaluable here.
Types of Facts
There are three types of facts:
• Additive: Additive facts are facts that can be summed up through all of the
dimensions in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some
of the dimensions in the fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any
of the dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes
that we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a
daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact,
because you can sum up this fact along any of the three dimensions present in the fact table
-- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week
represent the total sales amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each
day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive
fact, as it makes sense to add them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up through time (adding up
all current balances for a given account for each day of the month does not give us any
useful information). Profit_Margin is a non-additive fact, for it does not make sense to add
them up for the account level or the day level.
Types of Fact Tables
Based on the above classifications, there are two types of fact tables:
• Cumulative: This type of fact table describes what has happened over a period
of time. For example, this fact table may describe the total sales by product by store by
day. The facts for this type of fact tables are mostly additive facts. The first example
presented here is a cumulative fact table.
• Snapshot: This type of fact table describes the state of things in a particular
instance of time, and usually includes more semi-additive and non-additive facts. The
second example presented here is a snapshot fact table.
STAR SCHEMA
In the star schema design, a single object (the fact table) sits in the middle and is radially
connected to other surrounding objects (dimension lookup tables) like a star. Each dimension
is represented as a single table. The primary key in each dimension table is related to a
forieng key in the fact table.
Sample star schema
All measures in the fact table are related to all the dimensions that fact table is related to. In
other words, they all have the same level of granularity.
A star schema can be simple or complex. A simple star consists of one fact table; a complex
star can have more than one fact table.
Let's look at an example: Assume our data warehouse keeps store sales data, and the
different dimensions are time, store, product, and customer. In this case, the figure on the left
repesents our star schema. The lines between two tables indicate that there is a primary key /
foreign key relationship between the two tables. Note that different dimensions are not related
to one another.
SNOWFLAKE SCHEMA
The snowflake schema is an extension

of the star schema, where each point
of the star explodes into more points.
In a star schema, each dimension is
represented by a single dimensional
table, whereas in a snowflake schema,
that dimensional table is normalized
into multiple lookup tables, each
representing a level in the dimensional
hierarchy.
Sample snowflake schema
For example, the Time Dimension that consists of 2 different hierarchies:
1. Year → Month → Day

2. Week → Day
We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table
for month, a lookup table for week, and a lookup table for day. Year is connected to Month,
which is then connected to Day. Week is only connected to Day. A sample snowflake schema
illustrating the above relationships in the Time Dimension is shown to the right.
The main advantage of the snowflake schema is the improvement in query performance due
to minimized disk storage requirements and joining smaller lookup tables. The main
disadvantage of the snowflake schema is the additional maintenance efforts needed due to
the increase number of lookup tables.
• Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This
section explains the problem, and describes the three ways of handling this problem with examples.
The "Slowly Changing Dimension" problem is a common one particular to data warehousing.
In a nutshell, this applies to cases where the attribute for a record varies over time. We give
an example below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry
in the customer lookup table has the following record:
Customer Key Name State

1001 Christina Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc.
now modify its customer table to reflect this change? This is the "Slowly Changing
Dimension" problem.
There are in general three ways to solve this type of problem, and they are categorized as
follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the
original information. In other words, no history is kept.
In our example, recall we originally have the following table:

After Christina moved from Illinois to California, the new information replaces the new
record, and we have the following table:

1001 Christina California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since
there is no need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in
history. For example, in this case, the company would not be able to know that
Christina lived in Illinois before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the
data warehouse to keep track of historical changes.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent
the new information. Therefore, both the original and the new record will be present.
The newe record gets its own primary key.

After Christina moved from Illinois to California, we add the new information as a new
row into the table:

1005 Christina California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows
for the table is very high to start with, storage and performance can become a
concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.
Type 3: The original record is modified to reflect the change.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the
particular attribute of interest, one indicating the original value, and one indicating the
current value. There will also be a column that indicates when the current value
becomes active.

To accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:
• Customer Key
• Name
• Original State
• Current State
• Effective Date
After Christina moved from Illinois to California, the original information gets updated,
and we have the following table (assuming the effective date of change is January 15,
2003):
Customer Key Name Original State Current State Effective Date

1001 Christina Illinois California 15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than
once. For example, if Christina later moves to Texas on December 15, 2003, the
California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the
data warehouse to track historical changes, and when such changes will only occur for
a finite number of time.
We next take a look at each of the scenarios and how the data model and the data looks like
for each of them. Finally, we compare and contrast among the three alternatives.
• Data Models
Conceptual Data Model: What is a conceptual data model, its features, and an example of
this type of data model.
Logical Data Model: What is a logical data model, its features, and an example of this type
of data model.
Physical Data Model: What is a physical data model, its features, and an example of this
type of data model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data
model. This section compares and constrasts the three different types of data models.
The three level of data modeling, conceptual data model, logical data model, and physical
data model, were discussed in prior sections. Here we compare these three types of data
models. The table below compares the different features:
Feature Conceptual Logical Physical

Entity Names ✓ ✓
Entity Relationships ✓ ✓
Attributes ✓
Primary Keys ✓ ✓
Foreign Keys ✓ ✓
Table Names ✓
Column Names ✓
Column Data Types ✓
Below we show the conceptual, logical, and physical versions of a single data model.
Conceptual Model Design

Logical Model Design
Physical Model Design

We can see that the complexity increases from conceptual to logical to physical. This is why
we always first start with the conceptual data model (so we understand at high level what are
the different entities in our data and how they relate to one another), then move on to the
logical data model (so we understand the details of our data without worrying about how they
will actually implemented), and finally the physical data model (so we know exactly how to
implement our data model in the database of choice). In a data warehousing project,
sometimes the conceptual data model and the logical data model are considered as a single
deliverable.
• Data Integrity: What is data integrity and how it is enforced in data warehousing.
Data integrity refers to the validity of data, meaning data is consistent and correct. In the data
warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no
data integrity in the data warehouse, any resulting report and analysis will not be useful.
In a data warehouse or a data mart, there are three areas of where data integrity needs to be
enforced:
Database level
We can enforce data integrity at the database level. Common ways of enforcing data integrity
include:
Referential integrity
The relationship between the primary key of one table and the foreign key of another table
must always be maintained. For example, a primary key cannot be deleted if there is still a
foreign key that refers to this primary key.
Primary key / Unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a table can be
uniquely identified.
Not NULL vs NULL-able
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column can only have
positive integers, a value of '-1' cannot be allowed.
ETL process
For each step of the ETL process, data integrity checks should be put in place to ensure that
source data is the same as the data in the destination. Most common checks include record
counts or record sums.
Access level
We need to ensure that data is not altered by any unauthorized means either during the ETL
process or in the data warehouse. To do this, there needs to be safeguards against
unauthorized access to data (including physical access to the servers), as well as logging of
all data access history. Data integrity can only ensured if there is no unauthorized access to
the data.
• What is OLAP: Definition of OLAP.
OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to
OLAP was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this
particular white paper was sponsored by one of the OLAP tool vendors, thus causing it to
lose objectivity. The OLAP Report has proposed the FASMI test, Fast Analysis
of SharedMultidimensional Information. For a more detailed description of both Dr. Codd's
rules and the FASMI test, please visit The OLAP Report.
For people on the business side, the key feature out of the above list is "Multidimensional." In
other words, the ability to analyze metrics in different dimensions such as time, geography,
gender, product, etc. For example, sales for the company is up. What region is most
responsible for this increase? Which store in this region is most responsible for the increase?
What particular product category or categories contributed the most to the increase?
Answering these types of questions in order means that you are performing an OLAP
analysis.
Depending on the underlying technology used, OLAP can be braodly divided into two
different camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found
in the MOLAP, ROLAP, and HOLAP section.
• MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section
discusses how they are different from the other, and the advantages and disadvantages of each.
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP)
and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine
MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary formats.
Advantages:
• Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for
slicing and dicing operations.
• Can perform complex calculations: All calculations have been pre-generated when the
cube is created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
• Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a large amount of
data. Indeed, this is possible. But in this case, only summary-level information will be
included in the cube itself.
• Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are
additional investments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Advantages:
• Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words, ROLAP
itself places no limitation on data amount.
• Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since they sit
on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
• Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
• Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements do not
fit all needs (for example, it is difficult to perform complex calculations using SQL),
ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP
vendors have mitigated this risk by building into the tool out-of-the-box complex functions
as well as the ability to allow users to define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-
type information, HOLAP leverages cube technology for faster performance. When detail
information is needed, HOLAP can "drill through" from the cube into the underlying relational
data.
• Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of
the role between data warehouse and data mart.
In the data warehousing field, we often hear about discussions on where a person /
organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We
describe below the difference between the two.
Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence
system. An enterprise has one data warehouse, and data marts source their information from
the data warehouse. In the data warehouse, information is stored in 3rd normal form.
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the
enterprise. Information is always stored in the dimensional model.
There is no right or wrong between these two ideas, as they represent different data
warehousing philosophies. In reality, the data warehouse in most enterprises are closer to
Ralph Kimball's idea. This is because most data warehouses started out as a departmental
effort, and hence they originated as a data mart. Only when more data marts are built later do
they evolve into a data warehouse.
GLOSSARY
Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions
from the originalfact table. The resulting aggregate table will have fewer rows, thus making queries that can
use them go faster.
Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute
in the Time dimension.
Conformed Dimension: A dimension that has exactly the same meaning and content when being referred
from different fact tables.
Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a
more limited audience and/or data content.
Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process (as defined by Bill Inmon).
Data Warehousing: The process of designing, building, and maintaining a data warehouse system.
Dimension: The same category of information. For example, year, month, day, and week are all part of the
Time Dimension.
Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there
are two types of tables: dimensional tables and fact tables. Dimensional table records information on each
dimension, and fact table records all the "fact", or measures.
Dimensional Table: Dimension tables store records related to this particular dimension. No facts are
stored in a dimensional table.
Drill Across: Data analysis across dimensions.
Drill Down: Data analysis to a child attribute.
Drill Through: Data analysis that goes from an OLAP cube into the relational database.
Drill Up: Data analysis to a parent attribute.
ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another.
Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns:
fact columns and foreign keys to the dimensions.
Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a
hierarchy belong to the same dimension.
Metadata: Data about data. For example, the number of tables in the database is a type of metadata.
Metric: A measured value. For example, total sales is a metric.
MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.
OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of
slicing and dicing the data.
ROLAP: Relational OLAP. ROLAP systems store data in the relational database.
Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies
in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more
than a single dimension table.
Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by
a single dimension table.

Datawarehousing Bible

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Datawarehousing Bible

Hochgeladen von

Copyright:

Verfügbare Formate

Different people have different definitions for a data warehouse.

The most popular definition came from Bill

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data

Ralph Kimball provided a more concise definition of a data warehouse:

In general, all data warehouse systems have the following layers:

 Data Source Layer

Data Source Layer

Many different types of data can be a data source:

Data Extraction Layer

Data Storage Layer

Data Logic Layer

Data Presentation Layer

System Operations Layer

• Physical Environment Setup

Having different environments is very important for the following reasons:

• Identification of data sources.

• Data Mapping Document

• OLAP Cube Design

• Documentation specifying the OLAP cube dimensions and measures.

• Front End Development

Front End Deployment Documentation

• Report Specification Documentation.

• Performance tuning document - Goal and Result

First, we offer some guiding principles for query optimization:

1. Understand how your database is executing your query

2. Retrieve as little data as possible

3. Store intermediate results

Below are several specific query optimization strategies.

• Delivery of the data warehousing system to the end users.

Consistent availability of the data warehousing system to the end users.

• Change management documentation

Quick implementation time

• IBM purchased Cognos for $5 billion in 2007.

How to measure success

How often the system is being used.

Recipes for data warehousing project failure

Business intelligence tools: Tools commonly used for business intelligence.

1. It's relatively cheap.

3. It has most of the functionalities users need to display data.

OLAP tools are used for multidimensional analysis.

Data mining tool

Business intelligence uses: Different forms of business intelligence.

Business intelligence usage can be categorized into the following categories:

1. Business operations reporting

5. Finding correlation among different factors

1. Determine which dimensions will be included.

The determining factors usually goes back to the requirements.

Which Dimensions To Include

What Level Within Each Dimensions To Include

There are three types of facts:

Say we are a bank with the following fact table:

Types of Fact Tables

The snowflake schema is an extension

Sample snowflake schema

For example, the Time Dimension that consists of 2 different hierarchies:

1. Year → Month → Day

Customer Key Name State

In our example, recall we originally have the following table:

Customer Key Name State

Customer Key Name State

About 50% of the time.

When to use Type 1: