Beruflich Dokumente
Kultur Dokumente
Co-sponsored by:
tdwi.org
MAY 2015
TABLE OF CONTENTS
2
FOREWORD
3
NUMBER ONE
Modernize your data warehouse environment to
leverage new data and big data
4
NUMBER TWO
Support the data needs of new analytics with
a modern warehouse and other integrated
data platforms
5 NUMBER THREE
Re-architect the data warehouse and its environment
as you modernize
6 NUMBER FOUR
Consider Hadoop an extension of the modern
warehouse
7 NUMBER FIVE
Modernize ETL, not just the core warehouse
7
NUMBER SIX
Accelerate the business closer to real-time operations
as you modernize the data warehouse and related
systems
8 NUMBER SEVEN
Comply with external regulations and internal policies
as you handle data during modernization
9
NUMBER EIGHT
Apply modern economic criteria to selecting and
using data platforms
11 ABOUT OUR SPONSORS
12
ABOUT THE AUTHOR
12
ABOUT TDWI RESEARCH
12
ABOUT TDWI CHECKLIST REPORTS
555 S Renton Village Place, Ste. 700
Renton, WA 98057-3295
T
F
E
425.277.9126
425.687.2842
info@tdwi.org
tdwi.org
1TDWI RESEARCH
2015 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. E-mail requests or feedback to info@tdwi.org.
Product and company names mentioned herein may be trademarks and/or registered trademarks of
their respective companies.
tdwi.org
FOREWORD
must adapt to a wider range of data types, including schemafree and evolving ones.
4. O pen source software (OSS) is now ensconced in data
warehousing. Ten years ago, Linux was the only OSS product
commonly found in the technology stack for DWs, BI, analytics,
and data management. Today, TDWI regularly encounters OSS
products for reporting, analytics, data integration, and big data
management. This is because OSS has reached a new level of
functional maturity while still being economically desirable.
A growing number of user organizations are eager to leverage
both characteristics.
To help user organizations prepare, this TDWI Checklist Report
canvasses eight of the leading DW modernization scenarios,
discussing many of the new product types, functionality, and
user best practices (as well as the business case and technology
strengths) of each.
tdwi.org
NUMBER ONE
41%
No strategy, though
we need one
25%
23%
No strategy
because we
dont need one
6%
Other
5%
Figure 1 in this report is based on Figure 11 in the 2014 TDWI Best Practices Report Evolving Data Warehouse Architectures in the Age of Big Data, available for download at tdwi.org.
3TDWI RESEARCH
tdwi.org
NUMBER TWO
4TDWI RESEARCH
tdwi.org
NUMBER THREE
15%
37%
Central EDW
with many
additional data
platforms
16%
15%
No true
EDW; many
workloadspecific data
platforms
instead
15%
Other
2%
DWE
EDW
Central
monolithic EDW
with no other
data platforms
Figure 2 in this report is based on Figure 10 in the 2014 TDWI Best Practices Report Evolving Data Warehouse Architectures in the Age of Big Data, available for download at tdwi.org.
5TDWI RESEARCH
tdwi.org
NUMBER FOUR
an ODS migrate easily and perform well with little tweaking once
in Hadoop. In a similar trend, some users are working toward
an enterprise data hub (EDH), which extends the capabilities of
operational data stores, to bring more analytic workloads to larger
volumes of diverse data.
Data staging. Hadoop was designed for early ingestion, later
processing data management best practices. Hence, it adapts
well to data landing, data staging, and the transformational
processing of data that usually accompanies such practices.
Source data archiving. Its impossible to foresee all the ways
that source data will need to be repurposed for new analytic
applications in the future. The current practice is to retain raw,
extracted data with all its original details. Much of the expensive
storage capacity of EDWs is burned up by large archives of source
data; Hadoop can store and process this data just as well, but at a
fraction of the cost. Unlike old-fashioned archives that depend on
offline media such as magnetic tapes and optical disks, a Hadoopbased archive is online, queryable, and searchable, so users get
daily business value from it without time-consuming data-restore
processes.
Computational analytics. Valuable computational analytics
performed by Hadoop users today includes website behavior
analysis, sentiment analysis, clustering for customer base
segments, and many applications of statistical or mining
techniques with large volumes of diverse data.
ETL/ELT offload. Just as users offload data and analytic
workloads from the core DW to Hadoop, they also offload jobs for
extract, transform, and load (ETL). The catch is that some ETL or
ELT jobs are inherently relational or set-based because they involve
complex table joins or depend on advanced SQL functions; such
jobs are best controlled by a data integration tool and pushed
down into a relational DBMS. However, other ETL jobs count entity
occurrences or perform algorithmic processing but on a massive
scale, which is at the core of Hadoops design.
Operational data stores (ODSs). TDWI has found users who have
migrated ODSs from relational DBMSs to Hadoop, typically for use
with Hive and HBase, sometimes MapReduce and Pig. They report
that the straightforward record or relational data structures of
Readers unfamiliar with Hadoop may wish to read the TDWI Best Practices Reports Integrating Hadoop into Business Intelligence and Data Warehousing and Hadoop for the Enterprise, available for
download at tdwi.org.
3
6TDWI RESEARCH
tdwi.org
NUMBER FIVE
NUMBER SIX
ACCELERATE THE BUSINESS CLOSER TO REALTIME OPERATIONS AS YOU MODERNIZE THE DATA
WAREHOUSE AND RELATED SYSTEMS
For an in-depth examination of real-time operations, see the 2014 TDWI Best Practices Report Real-Time Data, BI, and Analytics, available on tdwi.org
7TDWI RESEARCH
tdwi.org
NUMBER SEVEN
8TDWI RESEARCH
tdwi.org
NUMBER EIGHT
You get what you pay for. Mature brands of relational database
management systems (RDBMSs) are premium products and
therefore command premium price tags. However, the expense is
worth it to get an RDBMSs rich variety and fully baked feature
sets for query optimization, SQL standards, indexing, workload
management, in-memory processing, data compression, metadata
management, large concurrent user bases, view technologies
(materialized, federated, virtual, dimensional), and a variety of
other system management and end-user productivity features.
These features are required for demanding data-driven practices,
such as data warehousing, reporting, business performance
management, operational BI, and OLAP. The data managed
for these practices is high value (and hence merits financial
investment) because its used by employees who make strategic
and operational decisions that deeply influence the success of the
enterprise. For these reasons, the vast majority of DWs today are
built on mature RDBMSs, anddue to the value returnedthese
organizations have little trouble justifying the cost.
You can pay now or pay later. Hadoop is based on open source
software that runs well on commodity-priced hardware. Hence,
Premium Functionality at a
Premium Price
Mature, Feature-Rich
Relational DBMSs
(Continues)
Range of Options
Emerging Functionality at a
Low Entry Price
tdwi.org
(Continued)
Cost and functionality are major drivers for data migration.
Again, the point of the multi-platform DW environment is to
manage a data set on a platform that is the best fit for it and
its workloads. Thats a technology consideration; yet, many users
are under pressure to control costs, so they look at both cost and
functionality considerations when they choose their platform and
physical placement of data. The balance of cost and functionality
is driving certain kinds of data migrations, usually in the context of
data warehouse modernization.
This report has discussed the leading options for data warehouse
modernization today, as well as future directions for modernization.
Most modernization efforts should consider all those options but give
priority to what the business needs from data, while leaving room for
innovation based on new data, new technologies, new architectures,
and new opportunities for managing costs in the modern data
warehouse environment.
10TDWI RESEARCH
tdwi.org
www.cloudera.com
www.impetus.com
www.mapr.com
MapR Technologies delivers on the promise of Hadoop with a proven
enterprise-grade platform that supports a broad set of missioncritical and real-time production uses. MapR brings unprecedented
dependability, ease of use, and world-record speed to Hadoop, NoSQL
data stores, and streaming applications in one unified distribution for
Hadoop. MapR is used by more than 500 customers across financial
services, government, healthcare, manufacturing, media, retail, and
telecommunications sectors as well as by leading Global 2000 and Web
2.0 companies.
MapR provides engineering contributions to several open source Apache
Hadoop projects including Apache Drill. Drill delivers interactive ANSI
SQL queries on Hadoop and NoSQL databases, without requiring the
building of centralized schemas. Drill is the first on-the-fly schemadiscovery SQL engine that brings instant insight from any data source
from simple files to complex hierarchical JSON data structures and
schema-less databases. You can get started with Drill in minutes by
downloading the MapR Sandbox for Drill.
11TDWI RESEARCH
www.teradata.com
The Teradata Unified Data Architecture (UDA) enables companies to get
more value from their data by connecting the dots across the business
for breakthrough insights and providing the agility to answer new
business questionsall while reducing overall costs and complexity.
The UDA is a proven, reliable, and cost-effective framework for
integrating analytics across Hadoop and the data warehouse.
As the market leader in data warehousing, Teradata has deep
engineering relationships with Hortonworks, Cloudera, and MapR that
provides customers with the choice to implement the best distribution
for their needs. Hadoop and the Integrated Data Warehouse are
orchestrated with products such as QueryGrid that through a single
query pushes down analytics to where the data resides across the
ecosystem, thereby reducing data movement and redundancies.
tdwi.org
12TDWI RESEARCH
tdwi.org