Beruflich Dokumente
Kultur Dokumente
A T U
P A
Mark Madsen
Third Nature Inc.
2 W F i f t h A v e n u e F o u r t h F l o o r, S a n M a t e o C A , 9 4 4 0 2
telephone: 888.494.1570
w w w. S n a p L o g i c . c o m
Table of Contents
Executive Summary
Data Acquisition
Processing and Management
Access and Delivery
What Do You Need to Build Today?
Conclusion
10
11
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
Executive Summary
New opportunities in business require a new platform to process data. The data warehouse has been used
to support many different query and reporting needs, but organizations want a general purpose, multiapplication, multi-user platform that supports needs other than just query and reporting: the data lake.
To date, most lake deployments have been built through manual coding and custom integration. Most of this
development effort is the first stage of work once this is done, the useful work of building business
applications can start.
Manual coding of data processing applications is common because data processing is thought of in terms of
application-specific work. Unfortunately, this manual effort is a dead-end investment over the long term
because products will take over the repeatable tasks. The new products will improve over time, unlike the
custom code built in an enterprise that becomes a maintenance burden as it ages.
This puts technology managers in a difficult position today. The older data warehouse environments and
integration tools are good at what they do, but they cant meet many of the new needs. The new
environments are focused on data processing, but require a lot of manual work. Should one buy, build or
integrate components? What should one buy or build?
The answer to this is to focus not on specific technologies like Hadoop but on the architecture. In particular,
one should focus on how to provide the core new capability in a data lake, general purpose data processing.
This paper describes the needs driving the data lake, why a data warehouse isn't up to the new tasks and
the architectural concepts to support the processing needs faced in the new environment.
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
to use to archival
Refine and deliver data as part of operational
processes, from batch to near real time
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
Most enterprise data integration tools were built assuming use of a relational database. This works well for
data coming from transactional applications. It works less well for logs, event streams and human-authored
data. These do not have the same regular structure of rows, columns and tables that databases and
integration tools require. These tools have difficulty working with JSON and must do extra work to process
and store it.
The reverse is not true. Newer data integration
data integration tools. One simple field change upstream can break a dataflow in the older tools, where the
more flexible new environment may be able to continue uninterrupted.
JSON is not the best format for storing data, however. This means tools are needed to translate data from
JSON to more efficient storage formats in Hadoop, and from those formats back to JSON for applications.
Much of the web and non-transactional data is sent today as JSON messages. The more flexible Hadoop
and streaming technologies are a better match for transporting and processing this data than conventional
data integration tools.
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
Many of these forms can be represented as tables in a database, but they are not accessed or processed
the same way a relational database processes queries against tables. These forms are one of the reasons we
have new non-relational engines, for example graph processing engines, time series stores and generalpurpose execution engines such as Spark.
Datasets composed of discrete objects, like image collections, scanned documents or free text are a poor
match for the database environment. A data lake must support the storage and processing of these as easily
as it supports the more structured forms of graphs and tables.
These requirements compel rethinking of basic
assumptions about data architecture and
system design, assumptions that have been
present for two decades. It is no longer
sufficient to use only a relational database and
an ETL tool, nor is there a single unified data
model for all data. Instead there are many
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
functional components) is shown in figure 1. The most important thing to note about diagrams like this is that
the hard work is hidden in the arrows, not the boxes.
Similar to a data warehouse, the data lake has varied sources of data. Unlike a data warehouse, that data
need not be cleaned during acquisition and the data can arrive at any frequency from real-time to hours or
days. The lake can persist the data regardless of its native format and arrival rate.
The data lake concept is not just a place to put data. Its a place to work on the data too. Processing in a
lake is unlike that done for a data warehouse. The platform handles both storage of data and its processing,
where a data warehouse separates these into extract, transform and load (ETL) tools and a database.
Because of this separation into layers, all the data must be conformed to tables and integrated before it can
be loaded.
A lake can retain data in its native format but can also standardize, clean, aggregate and then store data in
more consumable and conformed formats. This is a change from a data warehouse architecture. The same
information may be managed in raw, standardized and purpose-built formats on the same platform, at the
same time. Data warehouses have just one place for user-accessible data.
Another difference is access to data. A
data warehouse is the final resting place
for data. A data lake may be the
permanent location where data is used,
or it may support a data refinement
process whose endpoint is another
application. That application might be
built on the lake platform or it might be a
remote application that uses the data
independently. This adds data
management and integration elements to a
data lake that did not exist in the traditional
environment.
Figure 1: Data lake components. The acquisition component allows any data to
be collected at any latency. The management component allows some data to be
standardized and integrated. The access component provides access at any
latency and via any means an application or user needs. Processing can be done
to any data at any time from any area.
A data warehouse and legacy data integration tools can meet some of these needs, but not in a consistent
and unified fashion, because these needs were not the primary focus of their design. A data lake dictates
new requirements that the existing tools and environments cant easily support.
The remainder of this paper will focus on the core capability of a data lake that differentiates it from earlier
platforms: general-purpose data processing.
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
Data Acquisition
Acquiring data involves more than pointing systems that produce data at Hadoop. The history of Hadoop
has mainly been about data in streams or logs that are collected and stored, then processed. External
connectivity is largely missing, except within the new ecosystem of streams, logs and APIs. The general
assumption is that all data comes to the Hadoop environment where it resides. This simplifying assumption
works well when building a single Hadoop application, or when the organization is primarily an online digital
business.
The challenge comes when data is not in the new ecosystem but in existing enterprise systems, whether
they be SAP applications, custom Oracle systems or software-as-a-service (SaaS) applications like Workday
and Salesforce. For example, behavioral events and web clicks are useful, but core financial transactions and
product reference data are equally important if one wants to link behaviors to outcomes.
Enterprise data exists outside the big data ecosystem and must be fetched via different connectors and
mechanisms. The problem becomes one of data extraction, an area where the Hadoop ecosystem lacks
connectivity and tool support.
The data warehouse and data integration markets assume that data must be pulled from other systems. The
primary purpose of data processing in these environments is to extract data, integrate and clean it, then put
it into a database. These tools only offer a partial solution to the requirements of a data lake. They operate
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
systems and links it with previously collected data, is a challenge. Orchestrating many such dataflows, each
with different needs, dependencies and SLAs is more challenging still. Writing code to process data is only
one piece of a larger problem. All these extra pieces make manual coding a challenge.
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
The best way to see the challenge faced when building a data lake is to focus on integration in the Hadoop
environment. A common starting point is the idea of moving ETL and data processing from traditional tools
to Hadoop, then pushing the data from Hadoop to a data warehouse or database like Amazon Redshift so
that users can still work with data in a familiar way.
If we look at some of the specifics in this scenario, the problem of using a bill of materials as a technology
guide becomes apparent. For example, the processing of web event logs is unwieldy and expensive in a
database, so many companies shift this workload to Hadoop. Table 1 lists the components that could be
used to offload this work into Hadoop.
As with any set of components, there are tradeoffs in the choices. By choosing a specific language, the
choice of engine may be limited. For example, at the time of this writing Pig works with Tez and MapReduce
but not Spark. Choosing a specific engine may constrain the language, forcing developers to use an
unfamiliar or lower level language. The
Log Processing
Components Needed
tradeoffs extend to data storage. An
Requirement
optimal file format for storing the
Collect web log data and store it
for processing
performance.
desired information
data warehouse
loading tool
Cloudera Navigator
Monitor the jobs
collection of jobs
Metadata
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
Conclusion
When you set out to build a data lake you are really creating a data platform on which other applications
depend. A data lake should last for years without significant redesign that impacts the applications it
supports. This is one of the capabilities the data warehouse got right.
To build this requires a new architecture and tools to enable the end-to-end data processing. The data lake
requires similar tools that were provided by the warehouse: tools to hide the underlying platform complexity,
A Third Nature Whitepaper
10
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
orchestrate the components, and allow the developer to focus on the tasks that are important to build
applications.
Figure 2: SnapLogic accelerates the development of an enterprise data lake by automating data acquisition, transformation and access.
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
Data access: organizing and preparing data for delivery and visualization. Make data processed on Hadoop
or Spark easily available to off-cluster applications and data stores such as statistical packages and
business intelligence tools.
SnapLogics platform-agnostic approach decouples data processing specification from execution. As data
volume or latency requirements change, the same pipeline can be used just by changing the target data
platform. SnapLogics SnapReduceenables SnapLogic to run natively on Hadoop as aYARN-managed
resource that elastically scales out to power big data analytics, while the Spark Snap helps users create
Spark-based data pipelines ideally suited for memory-intensive, iterative processes. Whether MapReduce,
Spark or other big data processing framework, SnapLogic allows customers to adapt to evolving data lake
requirements without locking into a specific framework.
Key Takeaways
One way to escape the spiral of increasing complexity and maintenance costs with Hadoop is to develop not an
application, but a data lake that integrates repeatable, reusable components and isolates the highly changeable
work done by applications. This is a shift from the idea of Hadoop as a database to the idea of Hadoop as the
engine inside a data lake.
Much of the web and non-transactional data is sent today as JSON messages. The more flexible infrastructure of
a data lake can support both streaming and batch. This makes it a better match for transporting and processing
data than databases and conventional data integration tools.
Hadoop is to data lake as database is to data warehouse. Hadoop is one piece of the bigger system that is a
data lake. It requires other components that are not - at least today - part of the base platform.
Supporting Hadoop or a data lake in an enterprise setting is a challenge today because of the many components
that must be integrated. Higher-order tools are required to escape this complexity.
The technology market is shifting away from manual integration and custom coding for each Hadoop application
to higher level tools because the goal isnt to write support code, the goal is to use it to build valuable
applications.
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
About SnapLogic
SnapLogic is the industrys first unified data and application integration platform as a service (iPaaS). The
SnapLogic Elastic Integration Platform enables enterprises to connect to any source, at any speed, anywhere
whether on premises, in the cloud or in hybrid environments. The easy-to-use platform empowers selfservice integrators, eliminates information silos, and provides a smooth onramp to big data. Founded by data
industry veteran Gaurav Dhillon and backed by leading venture investors, including Andreessen Horowitz and
Ignition Partners, SnapLogic is helping companies across the Global 2000 to connect faster. Learn more
about SnapLogic for big data integration at www.SnapLogic.com/bigdata.
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?