Sie sind auf Seite 1von 13

TDWI RESEARCH

TDWI CHECKLIST REPORT

Eight Tips for


Modernizing a
Data Warehouse
By Philip Russom

Co-sponsored by:

tdwi.org

MAY 2015

TDW I CHECKLIST REPORT

Eight Tips for


Modernizing a
Data Warehouse
By Philip Russom

TABLE OF CONTENTS

2 
FOREWORD
3
NUMBER ONE

Modernize your data warehouse environment to
leverage new data and big data
4
NUMBER TWO

Support the data needs of new analytics with
a modern warehouse and other integrated
data platforms
5 NUMBER THREE

Re-architect the data warehouse and its environment
as you modernize
6 NUMBER FOUR

Consider Hadoop an extension of the modern
warehouse
7 NUMBER FIVE

Modernize ETL, not just the core warehouse
7
NUMBER SIX

Accelerate the business closer to real-time operations
as you modernize the data warehouse and related
systems
8 NUMBER SEVEN

Comply with external regulations and internal policies
as you handle data during modernization
9
NUMBER EIGHT

Apply modern economic criteria to selecting and
using data platforms
11 ABOUT OUR SPONSORS
12 
ABOUT THE AUTHOR

12 
ABOUT TDWI RESEARCH
12 
ABOUT TDWI CHECKLIST REPORTS
555 S Renton Village Place, Ste. 700
Renton, WA 98057-3295
T
F
E

425.277.9126
425.687.2842
info@tdwi.org

tdwi.org

1TDWI RESEARCH

2015 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. E-mail requests or feedback to info@tdwi.org.
Product and company names mentioned herein may be trademarks and/or registered trademarks of
their respective companies.

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

FOREWORD

No matter the vintage or sophistication of your organizations data


warehouse (DW) and the environment around it, it probably needs
to be modernized. DW modernization takes many forms. Common
scenarios range from software and hardware server upgrades to
the periodic addition of new data subjects, sources, tables, and
dimensions. As data types and data velocities continue to diversify,
many users are likewise diversifying their software portfolios to
include tools and data platforms built for new and big data. A few
organizations are even decommissioning current DW platforms to
replace them with modern ones optimized for todays requirements
in big data, analytics, real time, and cost control. No matter what
modernization strategy is in play, all require significant adjustments
to the logical and systems architectures of the extended data
warehouse environment.
Most of the trends driving the need for data warehouse
modernization boil down to four broad issues:

must adapt to a wider range of data types, including schemafree and evolving ones.
4. O pen source software (OSS) is now ensconced in data
warehousing. Ten years ago, Linux was the only OSS product
commonly found in the technology stack for DWs, BI, analytics,
and data management. Today, TDWI regularly encounters OSS
products for reporting, analytics, data integration, and big data
management. This is because OSS has reached a new level of
functional maturity while still being economically desirable.
A growing number of user organizations are eager to leverage
both characteristics.
To help user organizations prepare, this TDWI Checklist Report
canvasses eight of the leading DW modernization scenarios,
discussing many of the new product types, functionality, and
user best practices (as well as the business case and technology
strengths) of each.

1. O rganizations demand business value from big data. In


other words, users are not content to merely manage big
data and other valuable data from new sources, such as Web
applications, machines, devices, social media, and the Internet
of things. Because big data and new data tend to be exotic in
structure and massive in volume, users need new platforms
that scale with all data types if they are to achieve business
value.
2. T he age of analytics is here. Many firms are aggressively
adopting a wide variety of analytic methods so they can
compete on analytics and understand evolving customers,
markets, and business processes. There is a movement from
analyst intuition and statistics to empirical data-sciencedriven insights. Furthermore, todays consensus says that the
primary path to big datas business value is through so-called
advanced forms of analytics, based on technologies for
mining, predictions, statistics, and natural language processing
(NLP). Each analytic technology has unique data requirements
and DWs must modernize to satisfy all of them.
3. N ew challenges for real-time data. Technologies and
practices for real-time data have existed and been successfully
used for years. Yet, many organizations are behind in this area,
so its a priority for their data warehouse modernization efforts.
Even organizations that have succeeded with real-time data
warehousing and similar techniques will now need to refresh
their solutions so that real-time operations scale to exponential
data volumes, streams, and greater numbers of concurrent
users and applications. Furthermore, real-time technologies
2TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

NUMBER ONE

MODERNIZE YOUR DATA WAREHOUSE ENVIRONMENT


TO LEVERAGE NEW DATA AND BIG DATA

A founding principle of data warehousing is that user organizations


should repurpose data from the enterprise and other sources to gain
additional insights and guide decisions. In that spirit, organizations
are grappling with new data types and sources and how to capture
and manage these information assets, plus how to leverage them
for business advantage. For example:
Web logs. A common starting point for leveraging big data is to
assemble logs from Web servers and other Internet applications, then
sessionize and analyze the clickstream and shopping cart data they
contain to understand website visitor behavior and products of affinity
in an e-commerce context.
Industry-specific big data. Valuable data sets and analytics can be
assembled from call detail records (CDRs) in telecommunications; RFID
in retail, manufacturing, and other product-oriented industries; and
sensor data from robots in manufacturing and vehicles in logistics.
Human language and other text as big data. Tools based on natural
language processing, search, and text analytics provide visibility
into text-laden business processes, as in the claims process in
insurance, medical records in healthcare, and call center or help
desk applications in any industry. The killer app of human language
data is sentiment analytics, which has become common in customeroriented businesses, using both enterprise and social media big data.
Multi-structured data. Partnering firms that work together through a
supply chain often exchange information via XML and JSON documents,
which include a mixture of structured data, hierarchies, text, and other
elements. When processed and analyzed properly, these help quantify
profitable partners, supply quality, and supply chain efficiencies.
Managing and leveraging these new data types and sources is
worthwhile because of their business value. However, users are
challenged by the newness of the data, the massive volume of
many new data sets, the wide range of data structures, and the
streaming nature of some sources. The problem is further compounded
because most vendor platforms and user designs for traditional data

Extend existing data warehouse


to accomodate big data and other
new requirements

warehouses were originally designed for structured data alone or just


for relational data. Because many manifestations of new big data are
not relational (or even structured in any way), many users are asking:
How do we modernize our DW so that we both preserve our traditional
investment and embrace new types and sources of data?
Many users choose to reserve their core DW for the relational data that
goes into standard reports, dashboards, performance management,
and OLAP. For new big data, users are deploying specialized platforms
built for new data types, and they are integrating the new platforms
with the core DW and related systems. Specialized platforms include
those based on column stores and appliances, plus open source
Hadoop and NoSQL databases. Some data warehouse platform vendors
have incorporated native support for semi-structured data types
such as XML and JSONwith the relational environment to enable
tight integration between semi-structured and structured data types.
Given the real-world limitations of modernizing a DW thats tightly
wedded to the relational paradigm, complementing the relational DW
with other data platforms is a viable strategy for DW modernization.
Even so, some organizations prefer to replace the old DW platform with
a different platform thats more broadly suited to the extreme diversity
of data were witnessing today, even though rip-and-replace is timeconsuming and disruptive for the business.
To quantify users efforts with data warehouse modernization, a
recent TDWI survey asked: Which of the following best describes
your organizations strategy for evolving [or modernizing] your DW
environment and its architecture, relative to big data? Most survey
respondents plan to extend an existing DW (41%); the assumption is
that the DW platform in place is capable of handling a broad range of
data types and their workloads.
However, a few will deploy new data platforms (25%); they assume
these specialized platforms complement the core DW without replacing
it. Finally, 29% of respondents have no strategy for DW modernization
or addressing big data, which is not a good idea given the upsurge in
new big data and other modernization requirements.

Deploy new data


management systems,
specifically for big data,
analytics, real time, etc.

41%

No strategy, though
we need one

25%

23%

No strategy
because we
dont need one

6%

Other

5%

Figure 1. Strategies for data warehouse modernization.1


1

Figure 1 in this report is based on Figure 11 in the 2014 TDWI Best Practices Report Evolving Data Warehouse Architectures in the Age of Big Data, available for download at tdwi.org.

3TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

NUMBER TWO

SUPPORT THE DATA NEEDS OF NEW ANALYTICS WITH


A MODERN WAREHOUSE AND OTHER INTEGRATED
DATA PLATFORMS
We say analytics as if it were a single practice or technology. In
reality, there are many approaches to analytics, and there are many
enabling technologies, including mining, clustering, statistics,
predictive algorithms, SQL, hierarchies, dimensions, visualization,
and a wide array of natural language processing (NLP) techniques.
A ramification of the diversity of analytics is that the requirements
for data to be analyzed vary tremendously. Some analytic methods
demand relational data; others need some other structure. This,
in turn, complicates the modernization of a data warehouse that
must supply data for multiple analytic approaches. Again, given
the diversity of analytic data, many users choose to deploy multiple
purpose-built platforms, instead of expecting a relational warehouse
to supply all data types. Heres a rundown of data structures
required by various analytic methods:
Data exploration and discovery. Many analytic methods begin with
a data analyst exploring data as a prelude to analysis, reporting,
and visualization. Although its possible to explore data residing
on many platforms, a few organizations have relocated data to be
explored into data lakes, data vaults, and enterprise data hubs,
typically on Hadoop, a large configuration of an MPP database, or a
hybrid environment that supports elements of both.
Large data samples. Some analytic methods (for mining, statistics,
and clustering) work best with data samples of many terabytes or
petabytes. Many users house these on large MPP configurations,
but the trend is toward Hadoop integrated with a relational MPP
database.
Relational data. TDWI surveys show that after OLAP, the most
common form of analytics is so-called complex SQL or extreme SQL.
This involves hundreds of lines of SQL because data access, data
models, data transformations, and other elements are expressed
in SQL code instead of handling them elsewhere. For this form of
analytics, relational DBMSs are the obvious choice today, although
the progress of SQL on Hadoop may change this.

Hierarchies. Hierarchical business structures are all around us, in a


bill of materials, a chart of accounts, and XML or JSON documents.
Furthermore, some tools for data mining, text analytics, and
visualization produce hierarchies. Vendor brands of relational DBMSs
vary in their abilities to successfully manage hierarchies. The trend
is toward Hadoop.
File-based data. Significant new big data is captured in log files,
such as those generated by Web servers, enterprise applications,
machines (sensors, robots, and devices), and when streaming data
is captured. Hadoop was designed for logs and other file-based
data, so its a natural choice.
Multimedia data. Some organizations need to store, manage, and
analyze audio and video files, preferably in an active archive, which
Hadoop can enable.
Textual documents. For the analytic methods of sentiment analysis,
entity extraction, text mining, and other forms of NLP, the human
language and other forms of text they operate on are often filebased. For these applications, Hadoop is coming on strong as the
preferred storage and analytic processing platform.
Set-based and algorithmic approaches to analysis. Set-based
analytics usually entails relational techniques, namely SQL, tables,
keys, dimensions, etc.; optimizing and parallelizing operations
with these is easily done in a relational database environment.
Algorithmic analytics (sometimes called procedural analytics)
varies considerably, but a common example is the row-over-row
comparisons made in graph or time-series analyses. All forms of
algorithmic analysis optimize well in Hadoop.

Dimensional models. A true data warehouse will include


dimensional models, typically to support online analytic processing
(OLAP). Hence, the relational DW continues to be the first choice
for dimensional analytics, followed by relational appliances and
columnar databases.

4TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

NUMBER THREE

RE-ARCHITECT THE DATA WAREHOUSE AND ITS


ENVIRONMENT AS YOU MODERNIZE

Data warehouse modernization faces a perfect storm of


requirements: supporting new data, expanding analytics, coming
closer to true real-time operations, containing costs, planning
capacity, among others. One way to satisfy diverse requirements is
to diversify the software and hardware portfolio of the DW by adding
more tools and platforms to it. Thats exactly what roughly half of
organizations are doing.

practice many users need to apply during data exploration and


analytics. Admittedly, the additional platforms complicate the
architecture, but BI/DW professionals have dealt with a complex
technology stack for decades, so they are well-equipped for multiplatform DWEs. In addition, a number of data platform vendors are
extending their tools to simplify the orchestration, ingestion, and
consumption of data, regardless of where the data is persisted.

Many user organizations are evolving their mature enterprise


data warehouses (EDWs) into multi-platform data warehouse
environments (DWEs). To put it in historical perspective, the
technology stack for BI and DW has always had multiple tools and
platforms, including tools for reporting, analytics, and integration,
as well as database management systems (DBMSs) for the DW, data
marts, cubes, and operational data stores (ODSs).

A DWE enables a workload-centric architecture that gives


users more options. For example, a DWE assumes that some
workloads and their data are best offloaded from the core DW and
taken to a platform more suited to them. This includes workloads
and data for algorithmic analytics, extreme SQL-based analytics,
multi-structured data, massive big data, and real time. This
modernization strategy frees up capacity on the core DW so it can be
reallocated to expanding DW-specific data and workloads.

We say the warehouse or the EDW as if its one monolithic entity,


although for many organizations its long been a collection of moreor-less integrated tools, data platforms, and data sets. Rearranging
the EDW acronym to DWE acknowledges the extreme degree the
multi-platform DW and BI technology stack has achieved in recent
years, and its not just the DWE. Data management in other areas of
the enterprise has attained a similar extreme of platform diversity.
The current extreme of the multi-platform DWE has architectural
ramifications:
New data platforms enable new practices that complement the
core DW without replacing it. Thats because the DW is still the
best platform for the aggregated, standardized, and documented
data that goes into standard reports, dashboards, performance
management, operational analytics, and OLAP. Instead of replacing
it, the new platforms complement the warehouse because they are
optimized for workloads that manage, process, and analyze data
thats new, big, unstructured, exotic, or real time. Also, new data
platforms are better suited to the early ingestion, later processing

15%

Central EDW with a few


additional data platforms

37%

To quantify the trend toward multi-platform data warehouse


environments (DWEs), a recent TDWI survey asked: Which of the
following best describes your extended data warehouse environment
today? (See Figure 2.) Pure, central, monolithic EDWs are
relatively rare (15%, far left). Conversely, environments without a
DW are equally rare (15%, far right). The majority of DWs coexist
successfully with other platforms in a mixed environment (68%,
middle three segments of the chart). Even so, the degree of diversity
varies from a few additional platforms to many.

Central EDW
with many
additional data
platforms

16%

Many workloadspecific data


platforms; EDW
is present but not
the center

15%

No true
EDW; many
workloadspecific data
platforms
instead

15%

Other

2%

DWE

EDW

Central
monolithic EDW
with no other
data platforms

Note that the leading benefit of the workload-centric DWE is that it


gives users options: they can match a given data set or workload with
a platform thats the best technical fit or the most cost-effective.
In that context, modern organizations develop metrics for total cost,
ROI, functionality, performance, ownership, and other data platform
characteristics so that decisions about data platform usage are
enlightened by the full range of platform characteristics, not just
technical capabilities.

Figure 2. Evolving from the EDW to the modern DWE.2

Figure 2 in this report is based on Figure 10 in the 2014 TDWI Best Practices Report Evolving Data Warehouse Architectures in the Age of Big Data, available for download at tdwi.org.

5TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

NUMBER FOUR

CONSIDER HADOOP AN EXTENSION OF THE MODERN


WAREHOUSE

As we just saw, the multi-platform data warehouse environment


(DWE) is both a trend and a strategy for data warehouse
modernization. Among the new platforms proliferating in DWEs,
Hadoop is coming on strong for several reasons:3
Open source software (OSS) has recently achieved a higher
level of functional maturity, in general, across all types of OSS.
This makes Hadoop and other OSS products more attractive for
demanding enterprise uses.
A compelling balance of cost and performance is struck by
Hadoop. Vendor distributions of Hadoop add enterprise functions
required for enterprise use (security, administration, maintenance,
high availability, disaster recovery, query, etc.) but are more
affordable than comparable licenses for enterprise software.
Furthermore, Hadoop is proven to perform and scale linearly, even
when deployed on the cheapest commodity hardware.
Data-type diversity leads many users to Hadoop. Theoretically,
any data you can put in a file can be handled by the Hadoop
Distributed File System. This empowers user organizations to finally
get full business value from unstructured and semi-structured
data.
Computational power for advanced analytics is the true value
proposition for Hadoop. Hadoops renowned talent for storing
massive volumes of highly diverse data is merely a foundation for
computational analytics. This also makes Hadoop a complement
to the set-based analytics performed elsewhere in the DWE with
OLAP, SQL, and relational techniques.
Hadoop complements and extends other platforms without
replacing them. This adds years of productive use, new functionality,
and greater scale to traditional investments in data warehouses,
reporting tools, analytic tools, and data integration tools.
Early adopters and others have been using Hadoop integrated with
a DW for a few years now. From their successful experiences, we see
that there are a number of low-risk but high-value use cases that are
appropriate to users wishing to introduce Hadoop into their DWEs:

an ODS migrate easily and perform well with little tweaking once
in Hadoop. In a similar trend, some users are working toward
an enterprise data hub (EDH), which extends the capabilities of
operational data stores, to bring more analytic workloads to larger
volumes of diverse data.
Data staging. Hadoop was designed for early ingestion, later
processing data management best practices. Hence, it adapts
well to data landing, data staging, and the transformational
processing of data that usually accompanies such practices.
Source data archiving. Its impossible to foresee all the ways
that source data will need to be repurposed for new analytic
applications in the future. The current practice is to retain raw,
extracted data with all its original details. Much of the expensive
storage capacity of EDWs is burned up by large archives of source
data; Hadoop can store and process this data just as well, but at a
fraction of the cost. Unlike old-fashioned archives that depend on
offline media such as magnetic tapes and optical disks, a Hadoopbased archive is online, queryable, and searchable, so users get
daily business value from it without time-consuming data-restore
processes.
Computational analytics. Valuable computational analytics
performed by Hadoop users today includes website behavior
analysis, sentiment analysis, clustering for customer base
segments, and many applications of statistical or mining
techniques with large volumes of diverse data.
ETL/ELT offload. Just as users offload data and analytic
workloads from the core DW to Hadoop, they also offload jobs for
extract, transform, and load (ETL). The catch is that some ETL or
ELT jobs are inherently relational or set-based because they involve
complex table joins or depend on advanced SQL functions; such
jobs are best controlled by a data integration tool and pushed
down into a relational DBMS. However, other ETL jobs count entity
occurrences or perform algorithmic processing but on a massive
scale, which is at the core of Hadoops design.

Operational data stores (ODSs). TDWI has found users who have
migrated ODSs from relational DBMSs to Hadoop, typically for use
with Hive and HBase, sometimes MapReduce and Pig. They report
that the straightforward record or relational data structures of

Readers unfamiliar with Hadoop may wish to read the TDWI Best Practices Reports Integrating Hadoop into Business Intelligence and Data Warehousing and Hadoop for the Enterprise, available for
download at tdwi.org.
3

6TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

NUMBER FIVE

MODERNIZE ETL, NOT JUST THE CORE WAREHOUSE

NUMBER SIX

ACCELERATE THE BUSINESS CLOSER TO REALTIME OPERATIONS AS YOU MODERNIZE THE DATA
WAREHOUSE AND RELATED SYSTEMS

Data warehouse modernization is not limited to the warehouse per


se. A modernization strategy may be needed for the many tools and
platforms that interface with the DW and other data platforms in
the DWE. Thats potentially a long list of tools, so lets focus on
those for data integration (DI) and extract, transform, and load
(ETL).

For decades, BI professionals have pushed refreshing and delivering


reports and analyses closer to real time. Today, a number of common
BI practices handle data in near real time (minutes or hours), including
operational BI, dashboarding, and metrics-driven performance
management. These practices enable managers to make tactical and
operational decisions based on very fresh information.

Shuffling data in a modern DWE. As users add more types of


data platforms to their DW environments, they almost always
need to move data around to relocate it on new platforms that are
best suited to given data sets. Hence, early in DW modernization
initiatives, users must plan a number of data migrations,
consolidations, collocations, and data workload balancing. These
are typically done with a variety of DI tools, including those for ETL
or replication.

However, for some fast-paced, time-sensitive business processes,


near real time (also known as near time) isnt fast enough. They need
true real time, where data is handled within seconds, preferably
microseconds. Examples include applications for financial trading
systems, business activity monitoring, utility-grid monitoring,
e-commerce product recommendations, and facility surveillance.

Data integration infrastructure for the modern DWE. Users


have always needed a solid data integration architecture to cope
with complex data flows and multiple tools in the BI/DW technology
stack. A modern DWE takes that situation to a new extreme, and
a DWE assumes many complex multi-platform data flows. Hence,
data integration infrastructure is a critical success factor for daily
operations in a DWE.
Adapting to new ETL practices. Traditional data warehousing
practices use ETL to improve data before loading a DW. Users with
ample capacity on their DWs may push down some processing
into the DW, which is known as ELT. A new variation on these
practices ingests extracted data into the target data platform as
early as possible, then processes the data for specific purposes as
late as possible. Called early ingestion, late processing, this has
become a standard practice with new big data, especially when
Hadoop is the target.
Modernizing metadata management. This is especially
challenging with schema-free new data. Instead of developing
metadata a priori (as is the case with most DW practices today),
modern tools for Hadoop can deduce metadata at runtime from
a wide range of data structures, empowering a user to develop
metadata quickly as data is explored, discovered, and analyzed.
The same tools can also detect evolving data structures, track data
lineage, enable search, and update statistics and heuristics about
specified data.

For user organizations needing to modernize the DWE to handle data


in near time or real time, many technologies are available today and
therefore should be considered. The list includes data federation
and virtualization, data replication and synchronization, intraday
micro batches, columnar DBMSs, DW appliances, MPP computing
architectures, elastic clouds, in-database analytics, in-memory
functions, and solid-state drives.4 Note that the bar has been raised
on these; they must operate in various short time frames (sometimes
called right times) and they must also operate on a wider range of
data structures in unprecedented volumes.
Complex event processing (CEP) for streaming data. One form of
new big data is streaming data. Data streams into an organization
more or less continuously as a series of data records, each describing
a business event. For example, streams come online when users add
sensors to their machines, products, vehicles, and mobile devices,
plus turn on logging in Web or enterprise applications. Streaming data
is captured, triaged, and processed to determine a reaction; then an
automated response is executed by software or a user is alertedall
within seconds or milliseconds. Standalone CEP tools have arisen to
handle streams and users are adding CEP tools to their DWEs as they
modernize for true real-time operations.
Hadoop for streaming data. Early versions of Hadoop lacked
near-time and real-time capabilities. This situation has improved
considerably, with the introduction of open source projects for
capturing and analyzing streaming data (such as Samza, Spark, and
Storm). These promise to handle both the speed of real time and the
massive data volumes we expect in Hadoop. TDWI anticipates that
(Continues)

For an in-depth examination of real-time operations, see the 2014 TDWI Best Practices Report Real-Time Data, BI, and Analytics, available on tdwi.org

7TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

NUMBER SEVEN

COMPLY WITH EXTERNAL REGULATIONS AND


INTERNAL POLICIES AS YOU HANDLE DATA DURING
MODERNIZATION
(Continued)
Hadoop will become a preferred real-time platform because of
its low cost (as compared to commercial CEP platforms) and its
massive storage capabilities. After all, streaming data adds up to
large volumes in a hurry.
Interactive SQL on Hadoop. The many users using HiveQL with
Hive and HBase attest to the value of these tools. Yet, data
management professionals are calling for better support of
standard SQL on Hadoop so they can leverage their SQL skills and
their SQL-based tools. Likewise, data analysts need near-real-time
query responses in support of analytic practices such as data
exploration and ad hoc queries. The open source projects Drill and
Impala provide these and other functions. In addition, some vendor
distributions of Hadoop support file-system enhancements for fast
ingestion of data streams, so these are available immediately for
both analytic and operational workloads.
Streaming ETL on Hadoop. Hadoops capabilities designed for
handling and analyzing streaming data can also be used for
streaming ETL, which can aggregate, transform, and otherwise
process data as it arrives. Streaming ETL avoids the overhead and
latency of applying structure before load time, and by accelerating
the ETL process, downstream decision making and other business
processes are greatly accelerated.

Data warehouse modernization is an opportunity to create or improve


data governance best practices, plus related practices for data
standards and security.
New big data needs governance (DG), as would any data set. DW
modernization usually involves new data, and each new source should
be certified per established compliance and governance policies
prior to use. Because the policies and standards created by most
data governance committees are designed for structured data and
traditional platforms, data types and sources that are new to your
DWE may need new policies and standards or adjustments to older
ones (especially for exotic data from social media, geospatial, or
surveillance systems).
New big data needs improvement, as do most data sets. Data
governance is more than policies for compliance. A mature program
also establishes and enforces standards for datas quality, models,
architectures, semantics, and development methods. All data sets
have problems and opportunities that merit attention, whether old
or new, from the enterprise or beyond. Data standards help leverage
datas opportunities and remediate its problems. Dont just move data
during a data warehouse modernization; improve it as well.
Data exploration is a compliance accident waiting to happen. A
common goal for data warehouse modernization is to collect highly
detailed source data for data exploration and discovery, usually in
conjunction with analytics. Exploration is increasingly performed with
modern data sets, such as data lakes, data vaults, and enterprise
data hubs, whether on Hadoop or large MPP DBMS installations. To
avoid compliance and privacy violations, all these scenarios need
governance policies and the appropriate level of security, as explained
below.
Hadoop must be secure, just like other IT systems. Security in
purely open source Hadoop is limited to authorizations based on
Kerberos. This is useful, but its only one approach to security, whereas
mature enterprise IT teams tend to prefer multiple approaches. For
example, many IT organizations have standardized on role-based and
directory-based approaches. Eventually, users will also demand single
sign-on, encryption, and data masking.
Fortunately, additional security measures (and other enterprise-grade
functions) are available for Hadoop, typically from vendors that offer
Hadoop distributions. These functions make the distributions more
appealing to mature enterprises than does purely open source Hadoop.
Additional functionality is also available from software vendors in the
extended Hadoop ecosystem.

8TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

NUMBER EIGHT

APPLY MODERN ECONOMIC CRITERIA TO SELECTING


AND USING DATA PLATFORMS

The primary benefit of a modern multi-platform DW environment


is to proactively manage a data set on a data platform that is the
best technology fit for that data set and its associated workloads.
When possible, however, users should also manage data on a
platform that realizes a low total cost of ownership (TCO) or a high
return on investment (ROI), or both. The calculus of TCO and ROI
is complicated and fraught with exceptions but worth considering
if you need to innovate how you control costs in a modern data
warehouse environment. Note that TCO goes beyond acquisition
costs. An enterprise must consider costs in other areas, such as
development, maintenance, support, and usage. Note that ROI may
be expressed in either hard dollars or in soft benefits.

Hadoops acquisition costs are quite low compared to other data


platforms in the modern DW environment, giving Hadoop a low cost per
terabyte. However, the total cost of owning Hadoop mounts over time
to fund skilled personnel, system administration, and environmental
costs (such as power, space, and cooling). In particular, Hadoop
requires more advanced programming skills than peer systems do. For
example, experienced Hadoop users have spoken at TDWI conferences
about the high payroll costs required of data scientists, programmers,
and other highly technical staff. Yet, the same speakers point out that
Hadoops average TCO is still lower than a comparable MPP RDBMS
configuration; they would know because they have both, and the two
complement each other, as described earlier in this report.

Here are a few considerations of TCO and ROI for prominent


platform types in a modern data warehouse environment:

Purpose-built systems have their placeand their price.


Regardless of what their vendor creators intended, TDWI most often
finds DW appliances and columnar RDBMSs used for SQL-based
analytics performed by a relatively small user base but with very large
data volumes. Secondarily, TDWI finds these platforms supporting
multi-terabyte data marts, which are usually the foundation for
specific analytic applications. DW appliances and columnar RDBMSs
have a lower price point than mature, multi-purpose RDBMSs, which is
appropriate, given their limited use cases, small user constituencies,
and scaled-back relational functionality. DW appliances and
columnar RDBMSs fulfill an important role within a multi-platform
DW environment as effective but affordable platforms for analytic
sandboxes and departmental analytic applications.

You get what you pay for. Mature brands of relational database
management systems (RDBMSs) are premium products and
therefore command premium price tags. However, the expense is
worth it to get an RDBMSs rich variety and fully baked feature
sets for query optimization, SQL standards, indexing, workload
management, in-memory processing, data compression, metadata
management, large concurrent user bases, view technologies
(materialized, federated, virtual, dimensional), and a variety of
other system management and end-user productivity features.
These features are required for demanding data-driven practices,
such as data warehousing, reporting, business performance
management, operational BI, and OLAP. The data managed
for these practices is high value (and hence merits financial
investment) because its used by employees who make strategic
and operational decisions that deeply influence the success of the
enterprise. For these reasons, the vast majority of DWs today are
built on mature RDBMSs, anddue to the value returnedthese
organizations have little trouble justifying the cost.
You can pay now or pay later. Hadoop is based on open source
software that runs well on commodity-priced hardware. Hence,
Premium Functionality at a
Premium Price

Mature, Feature-Rich
Relational DBMSs

If we pull together the three prominent platform types discussed


above, they provide a diverse range of options for both functionality
and cost. This report discussed the established pairing of RDBMSbased warehouses with Hadoop. TDWI also sees appliances and
columnar databases tightly integrated with the relational warehouse
and Hadoop, as illustrated in Figure 3.

(Continues)

Range of Options

Appliances & Columnar DBs,


Built for DW & Analytics

Emerging Functionality at a
Low Entry Price

Open Source Hadoop


Built for Big Data

Figure 3. Range of platform options within a data warehouse environment.


9TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

(Continued)
Cost and functionality are major drivers for data migration.
Again, the point of the multi-platform DW environment is to
manage a data set on a platform that is the best fit for it and
its workloads. Thats a technology consideration; yet, many users
are under pressure to control costs, so they look at both cost and
functionality considerations when they choose their platform and
physical placement of data. The balance of cost and functionality
is driving certain kinds of data migrations, usually in the context of
data warehouse modernization.

This report has discussed the leading options for data warehouse
modernization today, as well as future directions for modernization.
Most modernization efforts should consider all those options but give
priority to what the business needs from data, while leaving room for
innovation based on new data, new technologies, new architectures,
and new opportunities for managing costs in the modern data
warehouse environment.

For example, many users complement their core data warehouse


with standalone implementations of columnar and appliance-based
relational databases. This migration frees up capacity on the DW
and provides a workload-specific platform optimized for complex,
SQL-based analytics. More recently, data migrations to Hadoop
have increased, as early adaptors offload the core warehouse and
take advantage of Hadoops inexpensive storage, scalability, and
analytic processing power. In other words, in the model shown in
Figure 3, theres a trend to migrate data from left to right. Despite
the migration of a minority of warehouse data sets, the relational
DW is as relevant as ever, and it has a new and more practical
focus on data sets that truly belong on it.
Data migrations aside, users also balance cost and functionality
in green field situations, as when they select platforms for new
data (such as that from machines and social media). A similar
balance is struck in common data warehouse modernization tasks,
especially the consolidation of proliferated data marts and ODSs.
Everyones different. Data warehouse modernization is a
golden opportunity for rethinking both TCO and ROI on both
single platform and total environment levels. Each organization
has its own unique mix of business, technology, and budgetary
requirements. Each organization will need to develop its own
metrics for quantifying platform TCO and ROI, the value of specific
data sets, and the value of certain user constituencies. These
financial metrics can complement technical considerations
such as the size and usage patterns of dataso that platform
acquisitions and usage decisions are fully enlightened and
innovative.

10TDWI RESEARCH

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

ABOUT OUR SPONSORS

www.cloudera.com

www.impetus.com

Cloudera is revolutionizing enterprise data management by offering


the first unified Platform for Big Data, an enterprise data hub built on
Apache Hadoop. Cloudera offers enterprises one place to store, access,
process, secure, and analyze all their data, empowering them to extend
the value of existing investments while enabling fundamental new
ways to derive value from their data. Clouderas open source Big Data
platform is the most widely adopted in the world, and Cloudera is the
most prolific contributor to the open source Hadoop ecosystem. As the
leading educator of Hadoop professionals, Cloudera has trained over
30,000 individuals worldwide. Over 1,450 partners and a seasoned
professional services team help deliver greater time to value. Finally,
only Cloudera provides proactive and predictive support to run an
enterprise data hub with confidence. Leading organizations in every
industry plus top public sector organizations globally run Cloudera in
production. For more information, visit us at www.cloudera.com.

Impetus Technologies provides innovative big data solutions and


services that empower large enterprises to unlock the full value of
their big data opportunities. Our proven methodologies and solutions
span the full life cycle of architecture advisory, proof of value, data
science, application development, and implementation services. We
have launched solutions for data warehouse modernization and realtime streaming data analytics. The data warehouse modernization
solution incorporates a proven productive methodology with an
automation toolset that substantially reduces the time and cost
of migrating ETL and analytics functions to Hadoop. We have also
introduced StreamAnalytix, an application development platform for
rapid development of real-time streaming analytics applications.
Both solutions leverage our deep experience as an early adopter of
big data technologies and offer enterprise-class support and ease
of use combined with the benefits of open source. By leveraging the
open source community, we are able to incorporate a vibrant source
of innovation while reducing cost for our enterprise customers. More
information regarding our big data services can be found at
bigdata.impetus.com and www.streamanalytix.com, or by writing to us
at bigdata@impetus.com.

www.mapr.com
MapR Technologies delivers on the promise of Hadoop with a proven
enterprise-grade platform that supports a broad set of missioncritical and real-time production uses. MapR brings unprecedented
dependability, ease of use, and world-record speed to Hadoop, NoSQL
data stores, and streaming applications in one unified distribution for
Hadoop. MapR is used by more than 500 customers across financial
services, government, healthcare, manufacturing, media, retail, and
telecommunications sectors as well as by leading Global 2000 and Web
2.0 companies.
MapR provides engineering contributions to several open source Apache
Hadoop projects including Apache Drill. Drill delivers interactive ANSI
SQL queries on Hadoop and NoSQL databases, without requiring the
building of centralized schemas. Drill is the first on-the-fly schemadiscovery SQL engine that brings instant insight from any data source
from simple files to complex hierarchical JSON data structures and
schema-less databases. You can get started with Drill in minutes by
downloading the MapR Sandbox for Drill.
11TDWI RESEARCH

www.teradata.com
The Teradata Unified Data Architecture (UDA) enables companies to get
more value from their data by connecting the dots across the business
for breakthrough insights and providing the agility to answer new
business questionsall while reducing overall costs and complexity.
The UDA is a proven, reliable, and cost-effective framework for
integrating analytics across Hadoop and the data warehouse.
As the market leader in data warehousing, Teradata has deep
engineering relationships with Hortonworks, Cloudera, and MapR that
provides customers with the choice to implement the best distribution
for their needs. Hadoop and the Integrated Data Warehouse are
orchestrated with products such as QueryGrid that through a single
query pushes down analytics to where the data resides across the
ecosystem, thereby reducing data movement and redundancies.

tdwi.org

TDWI CHECKLIST REPORT: EIGHT TIPS FOR MODERNIZING A DATA WAREHOUSE

ABOUT THE AUTHOR


Philip Russom is director of TDWI Research for data management
and oversees many of TDWIs research-oriented publications,
services, and events. He is a well-known figure in data warehousing
and business intelligence, having published over 500 research
reports, magazine articles, opinion columns, speeches, Webinars,
and more. Before joining TDWI in 2005, Russom was an industry
analyst covering BI at Forrester Research and Giga Information
Group. He also ran his own business as an independent industry
analyst and BI consultant and was a contributing editor with
leading IT magazines. Before that, Russom worked in technical and
marketing positions for various database vendors. You can reach
him at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at
linkedin.com/in/philiprussom.

ABOUT TDWI RESEARCH


TDWI Research provides research and advice for data professionals
worldwide. TDWI Research focuses exclusively on business
intelligence, data warehousing, and analytics issues and teams
up with industry thought leaders and practitioners to deliver both
broad and deep understanding of the business and technical
challenges surrounding the deployment and use of business
intelligence, data warehousing, and analytics solutions. TDWI
Research offers in-depth research reports, commentary, inquiry
services, and topical conferences as well as strategic planning
services to user and vendor organizations.

ABOUT TDWI CHECKLIST REPORTS


TDWI Checklist Reports provide an overview of success factors for
a specific project in business intelligence, data warehousing, or
a related data management discipline. Companies may use this
overview to get organized before beginning a project or to identify
goals and areas of improvement for current projects.

12TDWI RESEARCH

tdwi.org

Das könnte Ihnen auch gefallen