Sie sind auf Seite 1von 17

Getting Started with Learning About Data Warehousing

A Definition of Data Warehousing


A Definition of Decision Support
The Case for Data Warehousing
The Case Against Data Warehousing
Actions for Data Warehouse Success
Data Warehousing Gotchas
Performing Data Warehousing Software Evaluations
An (Informal) Taxonomy of Data Warehouse Data Errors
Data Warehousing Political Issues
Different Aspects of Data Warehouse Architecture
What to Learn About in Order to Speed Up Data Warehouse
Querying
What to Learn About in Order to Speed Up Data Warehouse
Loading
How to Save Money on Your Data Warehousing Efforts
Using Data Warehousing in Strategic Decision Making
Maintenance Issues for Data Warehousing Systems
What Decision Support Tools are Used For
Is Web Data Analysis (i.e., Web Data Mining) Different?

A data warehouse is a copy of transaction data


specifically structured for querying and
reporting.

The Case for Data Warehousing


The following is a list of the basic reasons why organizations
implement data warehousing. This list was put together because too
much of the data warehousing literature confuses "next order" benefits
with these basic reasons. For example, spend a little time reading data
warehouse trade material and you will read about using a data
warehouse to "convert data into business intelligence", "make
management decision making based on facts not intuition", "get closer
to the customers", and the seemingly ubiquitously used phrase "gain
competitive advantage". In probably 99% of the data warehousing
implementations, data warehousing is only one step out of many in the
long road toward the ultimate goal of accomplishing these highfalutin
objectives.
The basic reasons organizations implement data warehouses are:

To perform server/disk bound tasks associated with querying


and reporting on servers/disks not used by transaction
processing systems

Most firms want to set up transaction processing systems so there is a


high probability that transactions will be completed in what is judged
to be an acceptable amount of time. Reports and queries, which can
require a much greater range of limited server/disk resources than
transaction processing, run on the servers/disks used by transaction
processing systems can lower the probability that transactions
complete in an acceptable amount of time. Or, running queries and
reports, with their variable resource requirements, on the servers/disks
used by transaction processing systems can make it quite complex to
manage servers/disks so there is a high enough probability that
acceptable response time can be achieved. Firms therefore may find
that the least expensive and/or most organizationally expeditious way
to obtain high probability of acceptable transaction processing
response time is to implement a data warehousing architecture that
uses separate servers/disks for some querying and reporting.

To use data models and/or server technologies that speed up


querying and reporting and that are not appropriate for
transaction processing

There are ways of modeling data that usually speed up querying and
reporting (e.g., a star schema) and may not be appropriate for
transaction processing because the modeling technique will slow down
and complicate transaction processing. Also, there are server
technologies that that may speed up query and reporting processing
but may slow down transaction processing (e.g., bit-mapped indexing)
and server technologies that may speed up transaction processing but
slow down query and report processing (e.g., technology for
transaction recovery.) - Do note that whether and by how much a
modeling technique or server technology is a help or hindrance to
querying/reporting and transaction processing varies across vendors'
products and according to the situation in which the technique or
technology is used.
To provide an environment where a relatively small amount of
knowledge of the technical aspects of database
technology is required to write and maintain queries and
reports and/or to provide a means to speed up the writing
and maintaining of queries and reports by technical
personnel

Often a data warehouse can be set up so that simpler queries and


reports can be written by less technically knowledgeable personnel.
Nevertheless, less technically knowledgeable personnel often "hit a
complexity wall" and need IS help. IS, however, may also be able to
more quickly write and maintain queries and reports written against
data warehouse data. It should be noted, however, that much of the
improved IS productivity probably comes from the lack of bureaucracy
usually associated with establishing reports and queries in the data
warehouse.

To provide a repository of "cleaned up" transaction processing


systems data that can be reported against and that does
not necessarily require fixing the transaction processing
systems

Please read my essay on An informal taxonomy of data


warehouse data errors for an explanation of the type of "errors" that
need cleaning up. The data warehouse provides an opportunity to
clean up the data without changing the transaction processing
systems. Note, however, that some data warehousing implementations
provide a means to capture corrections made to the data warehouse
data and feed the corrections back into transaction processing
systems. Sometimes it makes more sense to handle corrections this
way than to apply changes directly to the transaction processing
system.

To make it easier, on a regular basis, to query and report data


from multiple transaction processing systems and/or from
external data sources and/or from data that must be
stored for query/report purposes only

For a long time firms that need reports with data from multiple
systems have been writing data extracts and then running sort/merge
logic to combine the extracted data and then running reports against
the sort/merged data. In many cases this is a perfectly adequate
strategy. However, if a company has large amounts of data that need
to be sort/merged frequently, if data purged from transaction
processing systems needs to be reported upon, and most importantly,
if the data need to be "cleaned", data warehousing may be
appropriate.

To provide a repository of transaction processing system data


that contains data from a longer span of time than can
efficiently be held in a transaction processing system
and/or to be able to generate reports "as was" as of a
previous point in time

Older data are often purged from transaction processing systems so


the expected response time can be better controlled. For querying and
reporting, this purged data and the current data may be stored in the
data warehouse where there presumably is less of a need to control
expected response time or the expected response time is at a much
higher level. - As for "as was" reporting, some times it is difficult, if not
impossible, to generate a report based on some characteristic at a
previous point in time. For example, if you want a report of the salaries
of employees at grade Level 3 as of the beginning of each month in
1997, you may not be able to do this because you only have a record
of current employee grade level. To be able to handle this type of
reporting problem, firms may implement data warehouses that handle
what is called the "slowly changing dimension" issue.

To prevent persons who only need to query and report


transaction processing system data from having any
access whatsoever to transaction processing system
databases and logic used to maintain those databases

The concern here is security. For example, data warehousing may be


interesting to firms that want to allow report and querying only over
the Internet.
Some firms implement data warehousing for all the reasons cited.
Some firm implement data warehousing for only one of the reasons
cited.
By the way, I am not saying that a data warehouse has no "business"
objectives. (I grit my teeth when I say that because I am not one to
assume that an IT objective is not a business objective. We IT people
are businesspeople too.) I do believe that the achievement of a
"business" objective for a data warehouse necessarily comes about
because of the achievement of one or many of the above objectives.
If you examine the list you may be struck that need for data
warehousing is mainly caused by the limitations of transaction
processing systems. These limitations of transaction processing
systems are not, however, inherent. That is, the limitations will not be
in every implementation of a transaction processing system. Also, the
limitations of transaction processing systems will vary in how crippling
they are.
Finally, to repeat the point I made initially, a firm that expects to get
business intelligence, better decision making, closeness to its
customers, and competitive advantage simply by plopping down a data
warehouse is in for a surprise. Obtaining these next order benefits
requires firms to figure out, usually by trial and error, how to change
business practices to best use the data warehouse and then to change
their business practices. And that can be harder than implementing a
data warehouse.

A Definition of Decision Support


The term decision support, if my knowledge of history of this area is
correct, goes back to the 1970s when it was coined by some
academics associated with the Massachusetts Institute of Technology.
Since then, many academic definitions have been offered. - My
purpose in this essay is to provide a definition that may lend clarity to
practitioners.

A decision support system or tool is one specifically designed


to allow business end users to perform computer
generated analyses of data on their own.

I believe the essence of decision support is, in the language of the


1960s, to allow end users to do their own thing. I note that this
definition is still fuzzy because what constitutes analyses and "on their
own" are debatable points.

We cannot say that decision support systems or tools


necessarily support the making of decisions.

What's in a name? - As far as I know, cognitive researchers do not


agree on how decisions are made. Therefore, saying that these tools
support making decisions is not a provable statement. Nor, is it, in may
opinion, an insightful way of defining these tools.
These tools do not analyze by themselves - rather they help a
person analyze

In other words, the tools facilitate analyses rather than perform


analyses. If you want to to learn more about how the tools facilitate
analyses, see my essay on What Decision Support Tools are Used
For.

Data warehousing and decision support systems and tools do


not necessarily go hand in hand.

Many data warehouses are not used as decision support systems. And
decision support systems or tools do not necessarily require the use of
a data warehouse as a source for data. I assert that, by far, the most
used decision support tools are spreadsheets not connected in any
automated way with a data warehouse.

Business intelligence seems to have become the vendors'


preferred synonym for decision support

My guess is because decision support has an academic connotation


and, as just mentioned, decision support systems do not necessarily
support decisions. On the other hand, business intelligence systems do
not necessarily make a business more intelligent. By the way, the
consultant-coined term business intelligence goes back to the late
1980s, fell out of use, and then was revived by the DW/DSS world in
the late 1990s. Confusingly, business intelligence is also used as a
synonym for competitive intelligence (and is probably a more apt term
for that area). By the way, "analytics" seems to be an up and coming
name for this area - despite the mid-1990 consultant-coined term
"analytical applications" never taking hold.

What Decision Support Tools are Used For


In the section on the "dirty little secrets of data warehousing" in her fascinating
book "e-Data", Jill Dyché notes many IT departments don't really know how the
business is using its data warehouse. It is not necessarily bad, though, if IT
does not know all the specific uses. Sometimes the sign of a great warehouse is
that the users "run with it" on their own.
Nevertheless, it is possible to get a general idea just what the decision support
(a.k.a., business intelligence) tools used to access a data warehouse are being
used for. In this essay, I will attempt to make a general statement about use of
these tools. Perhaps data warehouse support people can do a better job if they
have a better feel for what the tools are really being used for.
The main uses of decision support tools are:

To check that "everything" is okay

Surprise! Nothing will be done with many, perhaps most, of the queries and
reports created with decision support tools. They are run to confirm a person's
usually not crisply defined notion but intuitively felt notion of "okayness". If I
were able to write the essay on "The Zen of Data Warehousing" (which I will
not), I would say a primary function of decision support tools is to support non-
action.

To confirm the "obvious"

Most end users the reports and queries are ultimately being produced for have
a pretty good gut feel for what is going on in their area of concern. Decision
support tools do not tell these people anything amazing that the people don't
already suspect. But the information produced with the tools gives them
confidence their gut feel is okay.

To figure out how something "works"

Most people are not looking for some grand Unified Theory of how firm XYZ
works. Rather, they want to understand some small aspect of an operation like
Customer A always pays on time, Customer B usually pays late and still takes
the early payment discount, etc.

To convey information in a more digestible manner

These tools are often used to convey what a person or persons already know.
These knowing people use the tools simply to present information to other
people in a way that it is more easily read.

To compare information about customers, products, cost/profit centers,


financial accounts

Sometimes this is side by side comparisons of a series of measures. Sometimes


this is identification of the most, the least, the earliest, the latest, etc.

To compare the same type of information in different time periods


This is simply the usual daily, weekly, monthly, quarterly, yearly comparisons.

To check performance versus formal and informal goals or constraints

That is, measures of what actually occurred are compared with budgets,
forecasts, quotas, or some other types of goals.

To identify the out of the ordinary

Usually the ultimate consumer of the tool's output has somewhat vague criteria
of what is out of the ordinary. The decision support tools kind of do double duty
in that they help refine the criteria of what is out of the ordinary and identify
what fit the refined criteria of out of ordinariness.

To grab a little piece of information out of a large volume of


information

These tools make picking that virtual needle out of that virtual haystack a lot
simpler.

To get around an Information Technology department that does not


have the time or the resources to write reports

Often end users use these tools out of impatience with the IT department. Or,
the IT department gives the user these tools to relieve the pressure off of itself.
The end users in these cases often write reports that could hardly be called
analyses.

To provide a report "of record"

For all kinds of reasons it is often necessary for people to agree that "these are
the numbers". Note they do not have to agree on all the data - just some data
whose credibility must be accepted for actions to be taken. Decision support
tools often are used to produce this "official" information.

To confirm and sometimes to discover trends and relationships

With all respect to the people working hard on data mining, I think that most
good businesspeople have an intuitive feeling of the most important trends and
relationships between factors that are affecting their business. The decision
support tools perform the function of confirming their intuition. Yes, the tools
also can help discover trends and relationships but it is difficult (though
potentially profitable) to sift out the meaningless and spurious trends.

To help advocate a position

These tools are not just for "objective" presentation of the facts. Often they are
cleverly used to help bolster the case for doing (or not doing) something.

To provide data for a what if analysis or a forecast

That is, the tools are used to feed data into a spreadsheet where the actual
what-if analysis or forecast will be done. The tools can do some of the what-if-
ing and forecasting themselves but most business users are more comfortable
doing this work in spreadsheets.

To repeat points I have made in other essays, despite their name most of these
tools are not used as the sole input into making a non-trivial decision. Nor do
they directly supply what I would consider to be business intelligence. Decisions
are made and business intelligence is garnered only with the combination of the
output of the decision support tools, human judgment and intuition, and the
ability to put the information spit out by tools into a context of information that
is much wider than any data warehouse, transaction processing system,
knowledge repository can handle.

Actions for Data Warehouse Success


The following are some suggestions for the warehouse builder. These are points
I rarely see discussed or I do not see discussed enough in the barrage of articles
about data warehousing.

From day one establish that warehousing is a joint user/builder project

Warehouse projects will fail if the builders get specs from the users, go off for 6
months, and then come back with the 'finished' project. Warehouses are
iterative! (I think the word iterative means there are lots of mistakes in the
projects.) Builders and users working with each other will not reduce the
number of iterations, but it will reduce the size of them. By the way, see Peter
Block's Flawless Consulting for a great discussion of how to bring about 'joint'
projects.
Establish that maintaining data quality will be an ONGOING joint
user/builder responsibility

Organizations undertaking warehousing efforts almost continually discover data


problems. Best to establish right up front that this project is going to entail
some additional ongoing responsibility.

Train the users one step at a time

Typically users are trained once. In several days they learn both the basics and
intermediate and sometimes advanced aspects of using a tool. Slow down!
Consider providing training initially in the minimum needed for the user to get
something useful from the tool. Then let the user use the tool for a while
(meaning several days, weeks, or months). Having basic training and some
hands on experience, the user will have a much better context with which to
grasp the next level. Also, once the basics and the next level are learned, keep
training the users! After a year using the tool, schedule advanced training.

Train the users about the data stored in the data warehouse

Users often need more training about the stored data than about the tools used
to access the data. Do not assume the data are self-explanatory or that any
metadata you may provide will answer any questions. Note that users are often
used to seeing data in canned reports and seeing data in its "raw" form can be
confusing.

Consider doing a high level corporate data model / data warehouse


architecture "exercise" in three weeks

Actually, the key point regarding time is to "time-box" the exercise into a
relatively short time. After about three weeks, the marginal benefits from
additional time devoted to these types of exercises rapidly decrease. - The
corporate model is going to identify, at a high level, subjects and relationships
and most importantly, what are the chunks of information that it makes sense
to deliver in different projects. The architecture part of the exercise to
determine the dimensions, definitions of derived data, attribute names, and
information sources that you will attempt to use consistently in your data
warehousing efforts. The exercise also consists of coming to an agreement as to
how to keep the corporate model up-to-date and how to make sure future data
warehousing efforts pay attention to the architectural principles.

Implement a user accessible automated directory to information stored


in the warehouse
The majority of successful warehousing efforts I have seen included providing
some means for the warehouse user to locate stored information. Most of the
times this involved building a separate database with directory information.
And most of the time, a pretty simple database sufficed for initial use.

Once you know what raw data you want to feed into the data, request
that data

If you have done some reading on data warehouse development you probably
have read that figuring out the process of extracting, transforming, and loading
(ETL) usually takes the majority of the time in initial data warehouse
development. In project management lingo, figuring out ETL is usually on the
critical path. - If you know what raw data you need, request it as soon as you
know it. You are probably going to have to ask one of the programmers of the
legacy feeder systems to initially get this data for you. For reasons of politics,
overwork, and just plain lack of knowledge of how data are physically stored in
a system, the feeder system programmer often can take a while to get you that
data.

Determine a plan to test the integrity of the data in the warehouse

Do not underestimate the importance of user faith in the integrity of the


warehouse data. Huge warehouse efforts quickly go sour if after system roll-out
users find multiple mistakes. A good investment of time in the initial stages of a
warehouse project is for the builder and user to jointly determine what checks
will be made on the warehouse data during development and what checks need
to be made on an ongoing basis. The checks including tying warehouse data
controls back to controls in feeder systems, checking the correctness of
aggregation logic, testing whether classifications codes were assigned correctly.

From the start get warehouse users in the habit of 'testing' complex
queries

Many people will assume that the query result is correct. At the very least, get
the user in the habit of eyeballing the query or report to check if several records
that should be included are, in fact, included and that several records that
should not be included are, in fact, not included.

Coordinate system roll-out with network administration personnel

Use of data warehousing systems can bring about some strange spikes in
network activity. If you keep network administration people informed of the roll-
out schedule, chances are they will monitor network activity for you and be
ready to make adjustments to the network as necessary.
Have a good grasp of desktop databases and spreadsheets

Even if you are dealing with a 100 TB database, there are so many little tasks to
be done in a data warehousing project where knowledge of these tools will be
helpful. Skillful use of these tools during development can be a huge
productivity enhancer.

Be prepared to support beginning users immediately and at any time

We developers often greatly underestimate users' hesitation to begin using the


data warehouse. This hesitation could be because of user fear of technology or
user fear that they will not get IS support. So, the first point is to be available to
help when the user wants to try to use the data warehouse the first time. Users
also may want to use the data warehouse for the first time during the weekend
or at 6:00 in the morning or 8:00 at night. The distractions are less at those
times. If you want to make that beginning user as a committed customer of
your data warehouse, you better be available to support the user when he
starts out whatever the day or the hour.

Maintain the audit trail to the feeder systems

That is, make it as easy as possible to tie the data in the data warehouse to the
feeder systems. Your users have to trust the numbers in the data warehouse.
You owe this to the users in order to maintain their trust.

Market and sell your data warehousing systems

For the most part, use of data warehousing systems is optional. This means you
have to identify the potential users of the systems, help them understand what
are the benefits of the system, and then make them want to keep coming back
to use the system.

Data Warehousing Gotchas


Here are some points for the warehouse builder I rarely see discussed
or I do not see discussed enough in the barrage of articles about data
warehousing. Forewarned is forearmed!

You are going to spend much time extracting, cleaning, and


loading data
The usual figure quoted is that 80% of the time building a data
warehouse will be spent on this type of work. (No one has ever
explained how this percentage was obtained though.) Suffice it to say,
though, the amount of time on these tasks is often grossly
underestimated. Note that this point is about extracting and cleaning
and loading. Though by now many people are aware the cleaning the
data is complex, extracting data and loading data are equally, if not
more, complex.

Despite best efforts at project management, data warehousing


project scope will increase

To paraphrase data warehousing author W. H. Inmon, traditional


projects start with requirements and end with data. Data warehousing
projects start with data and end with requirements. Once warehouse
users see what they can do with 2000's technology, they will want
much more. (Which is fine!) One piece of advice for the warehouse
builder is never to ask the warehouse user what information he wants.
Rather, ask what information he wants next.

You are going to find problems with systems feeding the data
warehouse

Problems that have gone undetected for years will pop up. You are
going to have to make a decision on whether to fix the problem in what
you thought was the 'read-only' data warehouse or fix the transaction
processing system.

You will find the need to store data not being captured by any
existing system

A very common problem is to find the need to store data that are not
kept in any transaction processing system. For example, when building
sales reporting data warehouses, there is often a need to include
information on off-invoice adjustments not recorded in an order entry
system. In this case the data warehouse developer faces the possibility
of modifying the transaction processing system or building a system
dedicated to capturing the missing information.

You will need to validate data not being validated by


transaction processing systems

Typically once data are in warehouse many inconsistencies are found


with fields containing 'descriptive' information. For example, many
times no controls are put on customer names. Therefore, you could
have 'DEC', 'Digital' and, 'Digital Equipment' in your database. This is
going to cause problems for a warehouse user who expects to perform
an ad hoc query selecting on customer name. The warehouse
developer, again, may have to modify the transaction processing
systems or develop (or buy) some data scrubbing technology.

Some transaction processing systems feeding the warehousing


system will not contain detail

This problem is often encountered in customer or product oriented


warehousing systems. Often it is found that a system which contains
information that the designer would like to feed into the warehousing
system does not contain information down to the product or customer
level. By the way, this is what some people label a 'granularity'
problem.

You will underbudget for the resources skilled in the feeder


system platforms

In addition to understanding the feeder system data, you may find it


advantageous to build some of the "cleaning" logic on the feeder
system platform if that platform is a mainframe. Often cleaning
involves a great deal of sort/merging - tasks at which mainframe
utilities often excel. Also, you may find that you want to build
aggregates on the mainframe because aggregation also involves
substantial sorting.

Many warehouse end users will be trained and never or seldom


apply their training

I once read a study that claimed that only one quarter of the people
who get training in a query tool actually become heavy users of the
tool.

After end users receive query and report tools, requests for IS
written reports may increase

This phenomenon was seen with many of the information centers of


the 1980s. It comes about because the query and report tools allow
the user the users to gain a much better appreciation of what
technology could do. However, for many reasons the users are unable
to use the new tools themselves to realize the potential. By the way, if
this happens do some honest research on why. Granted there are
many reports that are so complex that IS expertise is going to be
required no matter what tool the end user has. However, many times
this phenomenon points to training needs.

Your warehouse users will develop conflicting business rules

Many warehouse tools allow users to perform calculations. The tools


will allow users to perform the same calculation differently. For
instance, suppose you are summarizing beverage sales by flavor
category. Also suppose that the flavor category includes cherry and
cola. If you have a cherry cola brand there is a chance that two users
will classify the brand in different categories. You will find that there
are means to incorporate some of the business rules in your
warehouse. However, the number of possible business rules is so large
that you will not be able to incorporate all rules.

Your warehouse users may not know how to use data

After many years of using whatever reports have been thrown in their
faces, the users may not know what data to use their newfangled
decision support tools to retrieve. To use a phrase from pop sociology,
the users have been "culturally conditioned" to use what they are
given and to never ask for more.

Large scale data warehousing can become an exercise in data


homogenizing

Data have quirks! Sometimes when we developers combine detailed


data for different subjects, in our efforts to make everything 'fit' we
can take the life out of the data. For instance, if your company sells
dog food and auto tires, you want to be careful if you are building a
sales data warehouse for both lines of business. You have to make a
judgment call as to whether these businesses fit the same logical
and/or physical model.

'Overhead' can eat up great amounts of disk space

A popular way to design a decision support relational databases is with


star or snowflake schemas. Persons taking this approach usually also
build aggregate fact tables. If there are many dimensions to the data,
be aware that the combination of the aggregate tables and indexes to
the fact tables and aggregate fact tables can eat up many times more
space than the raw data. If you are using multidimensional databases,
be aware that certain products pre-calculate and store summarized
data. As with star/snowflake schemas, storage of this calculated data
can eat up far more storage than the raw data.
The time it takes to load the warehouse will expand to the
amount of the time in the available window... and then
some

You'll do yourself well by understanding the different ways to approach


updating the warehouse. Before you decide that you can do complete
refreshes, be aware that "There's all day Sunday to load the
database!" have been famous last words of more than a handful of
warehouse developers.

You are going to have a tough problem with security -


especially if you make your data warehouse Web-
accessible

You are going to face a paradox - the more accessible you make your
data warehouse (and by accessible, I don't just mean making it Web
accessible - I mean architecting it in a way that people want to use it),
the greater security risk you are exposing yourself too. Frankly,
restricting people to "need to know" does not cut it in the organization
on the 2000s. But, on the other hand, exposing information to theft
from anyplace in the globe is not too great for job security either.

The data warehouse data you do not reconcile with the feeder
systems will cause the problems

For certain data warehouse data you are going to think that there is no
logical way that data in the feeder systems can be reconciled with
what are in the warehouse. Then, when a user looks at a report and
tells you "I think there is a problem", it will be with the unreconciled
data. Unfortunately, you will then discover there is a way, albeit
roundabout, to reconcile the data.

You are building a HIGH maintenance system

Reorganizations, product introductions, new pricing schemes, new


customers, changes in production systems, etc. are going to affect the
warehouse. If the warehouse is going to stay 'current' (and being
current will be a big selling point of the warehouse), changes to the
warehouse have to be made fast.

You will fail if you concentrate on resource optimization to the


neglect of project, data, and customer management
issues and an understanding of what adds value to the
customer
If you provide a system that is fast and technically elegant but adds
little value or has suspect data, you will probably lose your customer
from day one and will have a tough time getting him back. For the
most part, use of data warehousing systems is optional. The customer
has to want to use the system.

Das könnte Ihnen auch gefallen