Sie sind auf Seite 1von 17

Data Warehousing

April Reeve, in Managing Data in Motion, 2013


Layers in an enterprise data warehouse architecture
Data coming into the data warehouse and leaving the data warehouse use extract,
transform, and load (ETL) to pass through logical structural layers of the architecture that
are connected using data integration technologies, as depicted in Figure 7.1, where the data
passes from left to right, from source systems to the data warehouse and then to the
business intelligence layer. In many organizations, the enterprise data warehouse is the
primary user of data integration and may have sophisticated vendor data integration
tools specifically to support the data warehousing requirements. Data integration provides
the flow of data between the various layers of the data warehouse architecture, entering and
leaving.

Sign in to download full-size image


Figure 7.1.  Data Warehouse Data Flow.

Operational application layer


The operational application layer consists of the various sources of data to be fed into the
data warehouse from the applications that perform the primary operational functions of the
organization. This layer is where the portfolio of core application systems for the
organization resides. Not all reporting is necessarily transferred to the data warehouse.
Operational reporting concerning the processing within a particular application may remain
within the application because the concerns are specific to the particular functionality and
needs associated with the users of the application.
External data
Some data for the data warehouse may be coming from outside the organization. Data may
be supplied for the warehouse, with further detail sourced from the organization’s
customers, suppliers, or other partners. Standard codes, valid values, and other reference
data may be provided from government sources, industry organizations, or business
exchanges. Additionally, many data warehouses enhance the data available in the
organization with purchased data concerning consumers or customers.
External data must pass through additional security access layers for the network and
organization, protecting the organization from harmful data and attacks.
External data should be viewed as less likely to conform to the expected structure of its
contents, since communication and agreement between separate organizations is usually
somewhat harder than communications within the same organization. Profiling and quality
monitoring of data acquired from external sources is very important, even more critical,
possibly, than for monitoring data from internal sources. Integration with external data
should be kept loosely coupled with the expectation of potential changes in format and
content.
Data staging areas coming into a data warehouse
Data coming into a data warehouse is usually staged, or stored in the original source
format, in order to allow a loose coupling of the timing between the source and the data
warehouse in terms of when the data is sent from the source and when it is loaded into the
warehouse. The data staging area also allows for an audit trail of what data was sent, which
can be used to analyze problems with data found in the warehouse or in reports.
There is usually a staging area located with each of the data sources, as well as a staging
area for all data coming in to the warehouse.
Some data warehouse architectures include an operational data store (ODS) for having data
available real time or near real time for analysis and reporting. Real-time data integration
techniques will be described in later sections of this book.
Data warehouse data structure
The data in the data warehouse is usually formatted into a consistent logical structure for
the enterprise, no longer dependent on the structure of the various sources of data. The
structure of data in the data warehouse may be optimized for quick loading of high volumes
of data from the various sources. If some analysis is performed directly on data in the
warehouse, it may also be structured for efficient high-volume access, but usually that is
done in separate data marts and specialized analytical structures in the business intelligence
layer.
Metadata concerning data in the data warehouse is very important for its effective use and
is an important part of the data warehouse architecture: a clear understanding of the
meaning of the data (business metadata), where it came from or its lineage (technical
metadata), and when things happened (operational metadata). The metadata associated with
the data in the warehouse should accompany the data that is provided to the business
intelligence layer for analysis.
Staging from data warehouse to data mart or business intelligence
There may be separate staging areas for data coming out of the data warehouse and into the
business intelligence structures in order to provide loose coupling and audit trails, as
described earlier for data coming into the data warehouse. However, since writing data to
disk and reading from disk (I/O operations) are very slow compared with processing, it may
be deemed more efficient to tightly couple the data warehouse and business intelligence
structures and skip much of the overhead of staging data coming out of the data warehouse
as well as going into the business intelligence structures. An audit trail between the data
warehouse and data marts may be a low priority, as it is less important than when the data
was last acquired or updated in the data warehouse and in the source application systems.
Speed in making the data available for analysis is a larger concern.
Business Intelligence Layer
The business intelligence layer focuses on storing data efficiently for access and analysis.
Data marts are data structures created for providing to a particular part of an organization
data relevant to their analytical needs, structured for fast access. Data marts may also be for
enterprise-wide use but using specialized structures or technologies.
Extract files from the data warehouse are requested for local user use, for analysis, and for
preparation of reports and presentations. Extract files should not usually be manually
loaded into analytical and reporting systems. Besides the inefficiency of manually
transporting data between systems, the data may be changed in the process between the
data warehouse and the target system, losing the chain of custody information that would
concern an auditor. A more effective and trusted audit trail is created by automatically
feeding data between systems.
Extract files are sometimes also needed to be passed to external organizations and entities.
As with all data passing out from the data warehouse, metadata fully describing the data
should accompany extract files leaving the organization.
Data from the data warehouse may also be fed into highly specialized reporting systems,
such as for customer statement or regulatory reporting, which may have their own data
structures or may read data directly from the data warehouse.
Data in the business intelligence layer may be accessed using internal or external web
solutions, specialized reporting and analytical tools, or generic desktop tools. Appropriate
access authority and audit trails should be stored tracking all data accesses into the data
warehouse or business intelligence layers.
View chapter Purchase book
The Archive Data Store
Jack E. Olson, in Database Archiving, 2009
14.3.1 Transparent Access to Data from Application Programs
The requirement for an existing operational application program to see archive data without
modifying the application program is often requested. In other words, the data is still active,
not meeting the requirement for archiving. This sounds like a great feature to have.
However, the cost of providing this feature is very high. The data must be kept in a form
that conforms with the view of data from the existing application programs. This severely
constrains the options on the underlying archive database. Generally, this means that the
database must be the same or highly similar to the operational database.
It also implies that the data be readily available. This means that it is not shoved off
to offline storage or to devices with low access performance. If you put the data in a SAN,
you can expect the retrieval time from the SAN for most accesses to result in a timeout of
the application program. To prevent this you have to keep the entire archive up front—not
what you want to do.
It also implies that your application program calls the DBMS to be intercepted and rerouted
to the archive access routines on not found conditions or always if the user wants a
database union to occur. This type of intercept logic would increase the overhead cost of
executing transactions. If the archive database is not kept on the same system, the
processing model gets even more complex. No DBA is going to want that type of code
running on operational systems.
You cannot gain system and application independence through this feature. This means that
you keep old systems and applications around long after they are needed, adding to overall
IT cost.
For DBMSs, this feature is possible with a lot of work and clever programming.
For unload and XML stores, it is impossible.
A custom archive DBMS would not support it. It's a bad idea.
View chapter Purchase book
Metadata Management for MDM
David Loshin, in Master Data Management, 2009
6.4.1 Critical Data Elements
Of the thousands of data elements that could exist within an organization, how would one
distinguish “critical” data elements from your everyday, run-of-the-mill data elements?
There is a need to define what a critical data element means within the organization, and
some examples were provided earlier in this chapter. For an MDM program, the definition
of a critical data element should frame how all instances of each conceptual data element
are used within the context of each business application use. For example, if the master
data is used within a purely analytical/reporting scenario, the definition might consider the
dependent data elements used for quality analytics and reporting (e.g., “A critical data
element is one that is used by one or more external reports.”)
On the other hand, if the master data asset is driving operational applications, the definition
might contain details regarding specific operational data use (e.g., “A critical data element
is one that is used to support part of a published business policy or is used to support
regulatory compliance.”). Some other examples define critical data elements as follows:

“… supporting part of a published business policy”

“… contributing to the presentation of values published in one or more external
reports”

“… supporting the organization's regulatory compliance initiatives”

“containing personal information protected under a defined privacy or
confidentiality policy”

“containing critical information about an employee”

“containing critical information about a supplier”

“containing detailed information about a product”

“required for operational decision processing”

“contributing to key performance indicators within an organizational performance
scorecard”
Critical data elements are used for establishing information policy and, consequently,
business policy compliance, and they must be subjected to governance and oversight,
especially in an MDM environment.
View chapter Purchase book
Dimensional Warehouses from Enterprise Models
Charles D. Tupper, in Data Architecture, 2011
Warehouse Architecture
A data warehouse is a database that provides a single, consistent source of management
information for reporting and analysis across the organization (Inmon, 1996; Love,
1994). Data warehousing forces a change in the working relationship between IT
departments and users because it offers a self-service for the business model rather than the
traditional report-driven model. In a data warehousing environment, end users access data
directly using user-friendly query tools rather than relying on reports generally generated
by IT specialists. This reduces user dependence on IT staff to satisfy information needs.
A generic architecture for a data warehouse consists of the following components:

Operational application systems. These are systems that record the details of
business transactions. This is the source of the data required for the decision-support
needs of the business.

External sources. Data warehouses often incorporate data from external sources to
support analysis (purchased statistical data, raw market statistics data).

ETL. These processes extract, translate, and load the data warehouse with data on a
regular basis. Data extracted from different sources are consolidated, standardized,
and reconciled with data in a common, consistent format.

Enterprise data warehouse. This is the central source of decision-support data across
the enterprise. The enterprise data warehouse is usually implemented using a
traditional relational DBMS.

User interface layer. This GUI layer provides a common access method against the
enterprise data warehouse. Commonly this is where business intelligence tools are
found.

Persistent dimensionalized data repositories (data marts or, conversely, cubes).
These represent the specialized outlets of the enterprise data warehouse, which
provide data in usable form for analysis by end users. Data marts are usually
persistent views tailored to the needs of a specific group of users or decision-
making tasks. Data marts may be implemented using traditional relational DBMS or
OLAP tools. Cubes are multiple-dimensional arrays that support the same type of
analytical queries as data marts.

Users. Users write queries and analyze data stored in data marts using user-friendly
query tools.
Dimensional Modeling
From Ralph Kimball’s (1996) perspective, the data warehousing environment is profoundly
different from the operational one. Methods and techniques used to design operational
databases are inappropriate for designing data warehouses. For this reason, Kimball
proposed a new technique for data modeling specifically for designing data warehouses,
which he called “dimensional modeling” (we touched on this in the previous chapter). The
method was developed based on observations of practice and by vendors who were in the
business of providing data in a user-friendly form to their customers.
Dimensional modeling, although not based on any specific scientific formula or statistical
data occurrence theory, has obviously been very successful in practice. Dimensional
modeling has been adopted as the predominant method for designing data warehouses and
data marts in practice and, as such, represents an important contribution to the discipline of
data modeling and database design.
In early works Kimball posited that modeling in a data warehousing environment is
radically different from modeling in an operational environment and that one should forget
all previous knowledge about entity relationship models:
Entity relation models are a disaster for querying because they cannot be understood by
users and cannot be navigated usefully by DBMS software. Entity relation models cannot
be used as the basis for enterprise data warehouses.
It can be countered that the rigor in relational modeling is equally applicable to the
warehouse context as it is in the operational context and provides a useful basis for
designing both dimensional data warehouses and relational data warehouses.
View chapter Purchase book
Emerging Business Intelligence Trends
David Loshin, in Business Intelligence (Second Edition), 2013
Embedded Predictive Analytic Models
We have already discussed pervasive or integrated BI in which the results of BI and
analytics are fed directly into operational applications. While this approach has gained
traction, the field is still opportune for adoption, which is why we review it in this chapter
on new and emerging techniques.
The predictive models developed using a variety of data and text mining algorithms can be
integrated into business processes to supplement both operational decision-making as well
as strategic analysis, using the patterns that have been revealed to predict future events or
help in achieving a particular goal. For example, customer profiles created though the
application of a clustering analysis can be used for real-time classification based on specific
demographics or other characteristics. These profiles can be used to recommend
opportunities for cross-selling and up-selling, thereby leading to increased revenue.
Embedded predictive models can be used to address all of our value drivers, and are used in
many different scenarios, including customer retention, acquisitions and procurement,
supply chain improvements, fraud modeling, improving forecasting accuracy, clinical
decision-making, credit analysis, and automated underwriting.
View chapter Purchase book
Changes to Data Structures and Policies
Jack E. Olson, in Database Archiving, 2009
Dealing with change is one topic that can distinguish a good database archiving
application from a bad one. Changes to the operational applications impact only the data at
the time changes are made. Beyond that point, the data conforms to the new definition, not
the old. It is as though the old data structures never existed. For the archive, the old data
captured using the old structures could persist for many years. A single change event can
include a number of changes. It is normal for application developers to collect changes over
a period of time and make them to operational systems all at once. The application may
have a periodic change cycle once a quarter, semiannually, or annually.
View chapter Purchase book
Bringing It All Together
David Loshin, in The Practitioner's Guide to Data Quality Improvement, 2011
20.3.6 Master Data Management
Application architectures designed to support the operational aspects of each line of
business have required their own information technology support, and all its accoutrements
– data definitions, data dictionaries, table structures, application functionality, and so on, all
defined from the aspect of that business application. The result is that the “enterprise” is
often a mirage, and instead is a collection of many applications referring to multiple,
sometimes disparate sets of data that are intended to represent the same or similar business
concepts.
There is a growing desire to consolidate common data concepts from multiple sources,
analyze that data, and ultimately turn it into actionable knowledge to benefit the common
good. To exploit consolidated information for both operational and analytical processes, an
organization must be able to clearly determine what those commonly used business
concepts are, identify the different ways those concepts are represented, collect and
integrate that data, and then make that data available across the organization. Organizing,
integrating, and sharing enterprise information is intended to create a truly integrated
enterprise, and this is the challenge of what is known as master data management (MDM):
integrating tools, people, and practices to organize an enterprise view of the organization's
key business information objects, and to govern their quality, use, and synchronization and
use that unified view of the information to achieve the organization's business objectives.
Master data objects are those core business objects that are used by and shared among the
different applications across the organization, along with their associated metadata,
attributes, definitions, roles, connections, and taxonomies. Master data objects are those key
“things” that we value the most – the things that are logged in our transaction systems,
measured and reported on in our reporting systems, and analyzed in our analytic systems.
A master data system comprising a master data set is a (potentially virtual) registry or index
of uniquely identified entities with their critical data attributes synchronized from the
contributing original data sources and made available for enterprise use. With the proper
governance and oversight, the data in the master data system (or repository, or registry) can
be qualified as a unified and coherent data asset that all applications can rely on for
consistent high quality information.
Master data management is a collection of data management best practices associated with
both the technical oversight and the governance requirements for facilitating the sharing of
commonly used master data concepts. MDM incorporates policies and procedures to
orchestrate key stakeholders, participants, and business clients in managing business
applications, information management methods, and data management tools. Together,
these methods and tools implement the policies, procedures, services, and infrastructure to
support the capture, integration, and subsequent shared use of accurate, timely, consistent,
and complete master data.
An MDM program is intended to support an organization's business needs by providing
access to consistent views of the uniquely identifiable master data entities across
the operational application infrastructure. Master data management governs the methods,
tools, information, and services to:

Assess the use of commonly used information objects, collections of valid data
values, and explicit and implicit business rules in the range of applications across
the enterprise;

Identify core information objects relevant to business success that are used in
different application data sets that would benefit from centralization;

Instantiate a standardized model for integrating and managing those key information
objects;

Manage collected and discovered metadata as an accessible, browsable resource and
use it to facilitate consolidation;

Collect data from candidate data sources, evaluate how different data instances refer
to the same real-world entities, and create a unique, consolidated view of each real-
world entity;

Provide methods for transparent access to the unified view of real-world data
objects for both existing and newly developed business applications; and

Institute the proper data stewardship and management policies and procedures at the
corporate and line-of-business levels to ensure the high quality of the master data
asset.
Master data management is occupying a growing space in the mind share for data quality
and data governance, and warrants further investigation as a critical component of a data
quality management program.
View chapter Purchase book
Disaster Recovery
Kelly C. Bourne, in Application Administrators Handbook, 2014
12.6.2 Access to the DR site
Is access to the DR site restricted? If so, then you don’t want the process of getting access
to delay the process of getting your applications operational. Having answers to the
following questions could prevent significant delays in the event of a disaster:

If a disaster occurs, how will your IT staff get access to the site?

Is there someone at the site 24 × 7 that is authorized and able to distribute ID cards
or access cards?

Do your key people already have keys or badges needed to get into the facility?

If not, how long will it take to get access for them?

Does the remainder of your team have access?

If not how long will it take to get access?
View chapter Purchase book
The Archive Discard Component
Jack E. Olson, in Database Archiving, 2009
Amount of data discarded on one execution
Discard volumes have a different dynamic than extract volumes because they are not as
easy to predict.
It is not unusual for the discard policy to be looking for business objects older than any
object in the repository. If you have a 10-year retention requirement and the operational
application was put in place four years ago, the discard program will not find anything for
six years. Knowing this can save a lot of executions.
If you choose to just not execute because of this factor, you could inadvertently forget to
begin execution six years later and end up failing to discard data that should be discarded. It
would be helpful to have a different program that you run on the archive that ages all
business objects and produces an age/count chart from it. The discard policy date could be
added to make this an interesting chart. This could be run periodically, possibly once a
year, and stored in the repository. Such a report helps in anticipating when to start
executing the discard program.
The opposite dynamic is also possible. The data in operations can include large amounts of
data that were already past the discard date when they were initially archived. This will
result in a larger-than-expected volume of discarded business objects on the first discard
execution. However, it will be considerably smaller than the initial volume received from
extract since that blast of data not only includes all those objects but many more that belong
in the archive but are not old enough to discard.
Discard should reach a stable volume either after the first execution (if older data is present
in the archive) or later (if no data is ready for discard when archiving begins). It should
remain fairly stable from then on, with the typical seasonal variations.
View chapter Purchase book
Information Architecture
Rick Sherman, in Business Intelligence Guidebook, 2015
Operational BI versus Analytical BI
One of the ongoing struggles enterprises wrestle with is the clash between operational
versus management reporting. Operational reporting is tied to specific applications and is
typically provided by the application vendor as a pre-built offering. Management reporting
spans applications, is typically tied to a DW, and is custom-built using BI tools. The term
management refers not only to an enterprise’s management staff, but also business analysts
and any non-operational personnel.
Operational reporting is essential for the business people involved in running the business
on a day-to-day basis. Business transactions and events are captured, monitored, and
reported on by the operational applications. The benefits of relying on the application’s
operational reports are:

Pre-built reporting that does not require an IT project to custom-build the reports or
load the data into a DW

Real-time data access, enabling the business to get the most up-to-date data.

Integration and cooperation with operational processes, to streamline and expedite
these processes.

Data that is presented in the terminology of the underlying application that is well
understood by its business users.
Because enterprises have multiple source systems, management reporting used DWs that
were devised to integrate the source systems and provide the business with the five C’s of
data: consistent, conformed, comprehensive, clean, and current.
For quite a while these two worlds—DW versus applications—were clearly separate
domains with their own business consumers, IT staffs, budgets, and expertise. There was no
overlap between the people using it, the tools, the data, or the expertise needed. The closest
these groups came to working together was the scheduled feeds from the application to the
DW, but even these were generally handoffs to the BI team, which then went about doing
its own thing.
The BI landscape has changed over time in several ways:

Business people need both operational and management reporting

Application vendors adopted BI tools in their operational reporting offerings (or the
largest application vendors acquired the BI tool vendors themselves)

Some application vendors even built their own DW product offering

The data currency gap—the frequency of updating the data has narrowed or has
been eliminated in some instances.
The application vendor landscape has changed with a split between mega-vendors offering
applications with a significant enterprise footprint across many business processes and
specialty application vendors targeting specific business processes or industries. Enterprises
encounter a proliferation of applications, often a mix of on-premise and cloud-based, each
with their own operational reporting environment.
The result of these changes has been the rise of many reporting silos providing overlapping
and inconsistent data when compared to the enterprise BI environment using a DW. With
the 5 C’s of data broken in this landscape of data silos, what is IT to do? Assuming that it is
not acceptable to continue to maintain the status quo of reporting silos, these are the
alternatives:

Shift business users to a specific application reporting silo and supplement the
operational reporting with data from other applications

Shift all reporting to the DW-based BI environment

Blend application-specific and DW BI environments
Shift All Reporting to the Application-Specific Environment
There are two types of application vendors where this scenario get serious consideration:
1.
Mega-application vendor. This type of application vendor has a significantly
robust application that spans many business processes, offers its own DW
environment, and may even sell one of the major BI tools in the marketplace (a tool
that it acquired when it purchased the original BI vendor).
2.
Application vendor supporting cross-function business process. This type of
application vendor has a significant business footprint in an enterprise, offers
capability to import other application data, has an application development
platform, and likely has cloud-based support.
A key—and erroneous—assumption in this scenario is that the application-specific
reporting platform can be a substitute for the DW or even make it obsolete. Although it’s
possible to bring data into the application environment, you’ve got to ask if it can do an
adequate job or even try to. Nearly every enterprise, no matter how large or small, needs a
significant data integration effort to create data consistency, data integrity, and data quality.
It doesn’t happen overnight and it takes substantial resources with 60–75% of a DW effort
spent on data integration, not creating reports. Even if the application’s data integration
capabilities were up to the task, why would an enterprise shift its focus from supporting
business processes to being a DW? Why would an enterprise migrate its DW environment
to the application when there would be such a large cost, loss in time, and drain on
resources? And if the DW was not replaced, then the enterprise is, in effect, expanding its
multiple reporting silos to multiple DW silos. This makes no sense.
A second erroneous assumption is that all business people need real-time data from an
application. Certainly many applications, such as call centers and inventory management
operations, truly benefited from the real-time information. But then everyone wanted real-
time data. Operational BI was seen as the way to get it.
The reality is that most business analysis does not need real-time data. In fact, real-time
data would cause problems or create noise, e.g., inconsistent data results, that would have
to be filtered out. Much of what business people do is examine performance metrics, trend
reports, and exception reporting. Most are looking at daily, weekly, monthly, and year-to-
date analysis, not hourly analysis or trends. There are some industries where that would be
useful, but with most it doesn’t matter what was sold between 9:05 a.m. and 9:30 a.m.
today.
Remember, you build performance management and BI solutions to satisfy a business need.
Real-time BI often is suggested because it can be done technically, not because of a
business need. Do you really want to spend the resources, time, and budget for something
the business doesn’t need?
Shift All Reporting to the DW-Based BI Environment
Although there are many DW zealots that state that all reporting should come from the DW,
there are many practical reasons why this, like the shift to exclusively using applications-
based reporting previously discussed, should not be pursued.
With advances in data integration capabilities and productivity, enterprise BI environments
can capture data near-time or even real-time if there is a business need. This removes one
of the technical constraints on the DW’s ability to provide operational reporting, but there
are other considerations.
The primary reason to not have a DW assume all operational reporting is simply, “If it ain’t
broke, don’t fix it.” If the application’s pre-built reporting is being used and relied on,
particularly if it has been embedded in an enterprise’s business processes, you should not
pursue a replacement project with the associated expense, time, and resources to duplicate
something that is working. This would result in a lost opportunity to invest those resources
into expanding an enterprise’s overall BI capability and usage.
In addition, the application vendor will likely be expanding, updating, and maintaining its
operational reporting offering, which means that shifting this reporting to the DW would be
a continuing drain to duplicate what the vendor is providing.
Blend Application-Specific and DW BI Environments
With the understanding that neither the DW nor the application-based operational reporting
will be sufficient unto themselves, the solution is to blend these together. Although an
enterprise could just keep each environment in its own silo, it is very likely that business
people will not see consistent information in each silo. In addition, with totally separate
silos, the business consumers need to know which one to use under what conditions and
their experience within each silo will be different. Leaving the silos as is will be a drain on
both IT and business productivity.
The recommendations for creating a blended environment, as depicted in Figure 5.7, are to:
Sign in to download full-size image
FIGURE 5.7.  Blended BI environment.


Create a data architecture for your business users that includes the enterprise
applications, DWs, data marts, cubes, and even spreadsheets that are needed to
enable both operation and management reporting (see Chapter 6).

Create a technical architecture that enables data to be transformed and moved in
whatever manner is appropriate for the business purpose (see Chapter 7). There are
many times that data should be queried from the enterprise application directly
rather than insisting that the data be moved to a DW before the business people can
access it. Data virtualization, for example, enables querying across disparate data
sources, i.e., enterprise applications, where in the past the data had to be physically
moved into a common database using an ETL tool.

Create an information architecture that enables a BI portfolio for your business
people that spans operational and management reporting. Business people do not
need to understand the differences. They simply need to perform their analysis
regardless of where the data is located in your data architecture.

Leverage common BI tools across your BI portfolio. Business people should be
able to use the same BI tools for their reporting and analysis, regardless of where
the data comes from. They should not need to change their behavior based on where
in the data portfolio their analysis needs are met. Simplify their lives by letting them
go to one place and use one set of tools.
There should not be a business distinction between operational and analytical BI when it’s
properly designed into your information, data and technology architectures.
View chapter Purchase book

Das könnte Ihnen auch gefallen