Will The Data Lake Drown The Data Warehouse

A
A T U
P A
WILL THE DATA LAKE

DROWN THE DATA
WAREHOUSE?
Mark Madsen
Third Nature Inc.
This report is underwritten by SnapLogic, Inc.
2 W F i f t h A v e n u e F o u r t h F l o o r, S a n M a t e o C A , 9 4 4 0 2
telephone: 888.494.1570
w w w. S n a p L o g i c . c o m
Table of Contents
Executive Summary
Why the Data Lake? Why Not a Data Warehouse?
New Environment, New Requirements
Data structures: JSON is the new CSV

Streams are the new batch
Different datasets drive new engines
Describing a Data Lake Architecture
Data Processing Within a Data Lake
Data Acquisition
Processing and Management
Access and Delivery
What Do You Need to Build Today?
Conclusion
10
How Snaplogic Fits Within a Data Lake
11
A Third Nature Whitepaper

1
W i l l t h e D a t a L a ke D r ow n t h e D a t a Wa r e h o u s e ?
Executive Summary
New opportunities in business require a new platform to process data. The data warehouse has been used
to support many different query and reporting needs, but organizations want a general purpose, multiapplication, multi-user platform that supports needs other than just query and reporting: the data lake.
To date, most lake deployments have been built through manual coding and custom integration. Most of this
development effort is the first stage of work once this is done, the useful work of building business
applications can start.
Manual coding of data processing applications is common because data processing is thought of in terms of
application-specific work. Unfortunately, this manual effort is a dead-end investment over the long term
because products will take over the repeatable tasks. The new products will improve over time, unlike the
custom code built in an enterprise that becomes a maintenance burden as it ages.
This puts technology managers in a difficult position today. The older data warehouse environments and
integration tools are good at what they do, but they cant meet many of the new needs. The new
environments are focused on data processing, but require a lot of manual work. Should one buy, build or
integrate components? What should one buy or build?
The answer to this is to focus not on specific technologies like Hadoop but on the architecture. In particular,
one should focus on how to provide the core new capability in a data lake, general purpose data processing.
This paper describes the needs driving the data lake, why a data warehouse isn't up to the new tasks and
the architectural concepts to support the processing needs faced in the new environment.
Why the Data Lake? Why Not a Data Warehouse?

Theres an old saying in the data management world that data has gravity. Applications that contain enough
data tend to accumulate more data. This happens because integrating data from different areas adds value.
The challenge is that over time the data and its uses outstrip the capabilities of the architecture supporting it.
The data warehouse succeeded because it laid the foundation for many different yet related uses of data.
Today the warehouse is a victim of its own success, of the gravity that pulled more data and more
functionality into it than the architecture could support.
The process of growth and overloading happened many years ago with mainframe applications and their
combined reporting systems. Although centralized data worked for the applications, the reporting portion of
these systems fell apart due to the conflicting workloads of transaction processing and reporting, and to the
underlying complexity of the data models. The data warehouse was the solution to these problems. It
succeeded by separating data access from transaction processing and by applying a different set of design
principles.
This architecture worked well for over twenty years, to the point that many organizations assumed the data
management problem was solved. The reality is that the only challenge of reporting on transaction data was
2
The core capability of a data lake, and the source

of much of its value, is the ability to process
arbitrary data. This is what makes it
fundamentally different from a data warehouse.
The functional needs of the lake include the
ability to support the following:
Store datasets of any size
Process and standardize data, no matter what
structure or form the data takes
Integrate disparate datasets
solved. Experience with the use of a data warehouse

led to the rise of new needs as shown in the sidebar.
For a time these needs could be confined to the design
space encompassed by a data warehouse, but they
have exceeded that space for many years.
The new needs violate fundamental assumptions in the
data warehouse architecture and its technologies. While
the right decision might have been to redefine and
redesign the data warehouse, common approaches
have been to force new workloads into the old model or
Transform datasets from one form into another
to ignore the new workloads because they are not

viewed as data warehouse problems. The result of
Manage data stored in and generated by the

platform
either approach is to not meet the needs of the

business.
Provide a platform for exploration of data
The history of data infrastructure is littered with designs

that were based on short-term thinking to solve the
Provide a platform that enables complex

analytic or algorithmic processing
problem of the moment. The lesson of this approach

during both the mainframe reporting and data
Support the full lifecycle of data from collection
warehouse eras was that systems intended to be data

platforms have a long life, and that weak architectures
to use to archival
Refine and deliver data as part of operational
processes, from batch to near real time
end up as data junkyards where data goes to die.

When building a data lake, avoid the cost and
limitations of unplanned yet long-lived systems by

focusing on the uses and characteristics that define the architecture of a data platform.
New Environment, New Requirements

The pre-Hadoop environments, including integration tools that were built to handle structured rows and
columns, limit the type of data that can be processed. Requirements in the new ecosystem that tend to
cause the most trouble for traditional environments are variable data structure, streaming data and nonrelational datasets.
Data structures: JSON is the new CSV

The most common interchange format between applications is not database connectors but flat files in
comma-separated value (CSV) format, often exchanged via FTP. One of the big shifts in application design
over the past ten years was a move to REST APIs with payloads formatted in JSON, an increasingly
common data format. When combined with streaming infrastructure, this design shift reduces the need for
old style file integration. JSON and APIs are becoming the new CSV and FTP.
3
Most enterprise data integration tools were built assuming use of a relational database. This works well for
data coming from transactional applications. It works less well for logs, event streams and human-authored
data. These do not have the same regular structure of rows, columns and tables that databases and
integration tools require. These tools have difficulty working with JSON and must do extra work to process
and store it.
The reverse is not true. Newer data integration
Newer data integration tools can easily

represent tables in JSON, whereas nested
structures in JSON are difficult to represent
in tables.
tools can easily represent tables in JSON, whereas

nested structures in JSON are difficult to represent
in tables. Flexible representation of data enables
late binding for data structures and data types.
This is a key advantage of JSON when compared
to the early binding and static typing used by older
data integration tools. One simple field change upstream can break a dataflow in the older tools, where the
more flexible new environment may be able to continue uninterrupted.
JSON is not the best format for storing data, however. This means tools are needed to translate data from
JSON to more efficient storage formats in Hadoop, and from those formats back to JSON for applications.
Much of the web and non-transactional data is sent today as JSON messages. The more flexible Hadoop
and streaming technologies are a better match for transporting and processing this data than conventional
data integration tools.
Streams are the new batch

Often, the initial sources of data in a data lake come from event streams and can be processed continuously
rather than in batch. As a rule, a data warehouse is a poor place to process data that must be available in
less than a few minutes. The architecture was designed for periodic incremental loads, not for a continuous
stream of data. A data lake should support multiple speeds from near real-time to high latency batch.
Batch processing is actually a subset of stream processing. It is easy to persist data for a time and then run
a job to process it as a batch. It is not as easy to take a batch system and make it efficiently process data
one message at a time. A batch engine cant keep up with streaming requirements, but tools that have been
designed to process streaming data can behave like a batch engine.
Streaming data also implies that data volume can fluctuate, from a small trickle during one hour to a flood in
the next. The fixed server models and capacity planning of a traditional architecture do not translate well to
dynamic server scaling as message volume grows and shrinks. This requires rethinking how one scales data
collection and integration.
Different datasets drive new engines

Just as a data structure applies to a single item, the entire collection of a single dataset has structure. This is
termed its structural form. Example forms are relational tables, or a collection of nodes and links (a graph)
or a time series.
4
Many of these forms can be represented as tables in a database, but they are not accessed or processed
the same way a relational database processes queries against tables. These forms are one of the reasons we
have new non-relational engines, for example graph processing engines, time series stores and generalpurpose execution engines such as Spark.
Datasets composed of discrete objects, like image collections, scanned documents or free text are a poor
match for the database environment. A data lake must support the storage and processing of these as easily
as it supports the more structured forms of graphs and tables.
These requirements compel rethinking of basic
assumptions about data architecture and
system design, assumptions that have been
present for two decades. It is no longer
sufficient to use only a relational database and
an ETL tool, nor is there a single unified data
model for all data. Instead there are many
The data lake must incorporate aspects of

the old environment like connecting to and
extracting data from ERP or transaction
processing systems, yet do this without
clunky and inefficient tools like Sqoop.
discrete data sets that can be integrated as

needed, stored in their original form or in various stages of integration all the way through to the heavily
standardized and quality-assured data one finds in a data warehouse.
The data lake must incorporate aspects of the old environment like connecting to and extracting data from
ERP or transaction processing systems, yet do this without clunky and inefficient tools like Sqoop. The data
lake also must support new capabilities like reliable collection of large volumes of events at high speed and
timely processing to make data available immediately.
Describing a Data Lake Architecture

Hadoop and the idea of a data lake is the latest incarnation of the desire to provide a centralized platform
to collect, process and use data. Hadoop offers the promise of being able to do things the database cant,
and the data lake offers the promise of an architecture that expands on the data warehouse by delivering
different data and processing capabilities. We once again have a new set of technologies that extend what is
possible, requiring the definition of a new architecture.
The most important capability that Hadoop brings to the IT market is not inexpensive scalability or cheap
storage for extremely large volumes of data. Hadoop offers a platform for general-purpose data processing.
The data lake is the architecture to express that capability.
Over decades we developed architectures for efficiently recording transactions and efficiently making queries.
What we lacked was a platform that enabled data processing, whether it be simple data transformation and
integration or complex machine learning algorithms. The data lake is not a substitute for the other platforms,
but an augmentation of them.
A data lake is an architecture built around Hadoop to provide a broad set of capabilities, similar to the way a
data warehouse is built around a database. The high-level picture of a data lake architecture (the primary
5
functional components) is shown in figure 1. The most important thing to note about diagrams like this is that
the hard work is hidden in the arrows, not the boxes.
Similar to a data warehouse, the data lake has varied sources of data. Unlike a data warehouse, that data
need not be cleaned during acquisition and the data can arrive at any frequency from real-time to hours or
days. The lake can persist the data regardless of its native format and arrival rate.
The data lake concept is not just a place to put data. Its a place to work on the data too. Processing in a
lake is unlike that done for a data warehouse. The platform handles both storage of data and its processing,
where a data warehouse separates these into extract, transform and load (ETL) tools and a database.
Because of this separation into layers, all the data must be conformed to tables and integrated before it can
be loaded.
A lake can retain data in its native format but can also standardize, clean, aggregate and then store data in
more consumable and conformed formats. This is a change from a data warehouse architecture. The same
information may be managed in raw, standardized and purpose-built formats on the same platform, at the
same time. Data warehouses have just one place for user-accessible data.
Another difference is access to data. A
data warehouse is the final resting place
for data. A data lake may be the
permanent location where data is used,
or it may support a data refinement
process whose endpoint is another
application. That application might be
built on the lake platform or it might be a
remote application that uses the data
independently. This adds data
management and integration elements to a
data lake that did not exist in the traditional
environment.
Figure 1: Data lake components. The acquisition component allows any data to
be collected at any latency. The management component allows some data to be
standardized and integrated. The access component provides access at any
latency and via any means an application or user needs. Processing can be done
to any data at any time from any area.
A data warehouse and legacy data integration tools can meet some of these needs, but not in a consistent
and unified fashion, because these needs were not the primary focus of their design. A data lake dictates
new requirements that the existing tools and environments cant easily support.
The remainder of this paper will focus on the core capability of a data lake that differentiates it from earlier
platforms: general-purpose data processing.
Data Processing Within a Data Lake

With the focus on data processing as a new capability, it is important to examine the three basic categories
of data processing activities in a data lake.
6
Data Acquisition
Acquiring data involves more than pointing systems that produce data at Hadoop. The history of Hadoop
has mainly been about data in streams or logs that are collected and stored, then processed. External
connectivity is largely missing, except within the new ecosystem of streams, logs and APIs. The general
assumption is that all data comes to the Hadoop environment where it resides. This simplifying assumption
works well when building a single Hadoop application, or when the organization is primarily an online digital
business.
The challenge comes when data is not in the new ecosystem but in existing enterprise systems, whether
they be SAP applications, custom Oracle systems or software-as-a-service (SaaS) applications like Workday
and Salesforce. For example, behavioral events and web clicks are useful, but core financial transactions and
product reference data are equally important if one wants to link behaviors to outcomes.
Enterprise data exists outside the big data ecosystem and must be fetched via different connectors and
mechanisms. The problem becomes one of data extraction, an area where the Hadoop ecosystem lacks
connectivity and tool support.
The data warehouse and data integration markets assume that data must be pulled from other systems. The
primary purpose of data processing in these environments is to extract data, integrate and clean it, then put
it into a database. These tools only offer a partial solution to the requirements of a data lake. They operate
The old extract, transform and load (ETL)

tool market and the new Hadoop market
each solve part of the total problem. A data
lake needs to be able to both passively
receive and actively fetch data.
outside of rather than within the Hadoop

ecosystem.
The old extract, transform and load (ETL) tool
market and the new Hadoop market each solve
part of the total problem. A data lake needs to
be able to both passively receive and actively
fetch data.
Processing and Management

To date, the big data market has focused on making the writing of data processing code easier. There are
simple mechanisms to collect logs or to stream events into the platform, many engines to process different
types of data and many ways to support workloads that range from simple data formatting to complex realtime algorithms.
The traditional tools market has been focused on a set of activities that are predominantly related to ETL:
extracting, filtering, cleaning, standardizing and integrating data. The big data market emphasizes other
activities, some of which overlap: collection, filtering, feature extraction (for example, pulling data out of
bodies of text or isolating key elements of a dataset to feed to analytics), computational processing, and realtime as well as batch execution.
Building a data pipeline to do this new work is not difficult, so long as you use a limited set of components.
Managing a complex dataflow involving multiple components, particularly one that extracts data from other
7
systems and links it with previously collected data, is a challenge. Orchestrating many such dataflows, each
with different needs, dependencies and SLAs is more challenging still. Writing code to process data is only
one piece of a larger problem. All these extra pieces make manual coding a challenge.
Access and Delivery

Unlike traditional data warehouses, where there is only one final place to land data, there may be several
places for the same data in a data lake.
Data received from a stream might simply be transformed from one file format into another to make it easier
to access. The output from a job might land both in a file where it is SQL-accessible as well as being
transformed and stored for use by another engine.
Data can be published for consumption as well. Other tools can then reach into the environment to pull out
the data they need. For example, many organizations process data to create extract files that are then used
by tools like R or Tableau.
Data can also be pushed to other systems. For example, offloading ETL from a data warehouse into Hadoop
means the final data must be sent to the database, or a remote ETL engine must be notified to fetch the data
from Hadoop. Those systems, such as Amazon Redshift, may not be on-premises, leading to more work in
the environment.
Streaming data can be created as well as stored. It is
All of these possibilities for data

delivery make the data lake a much
richer environment than what is
provided by integration tools built
specifically to load a data
warehouse.
possible for event processing jobs to publish output to a

stream or queue that other applications are using. The
output of this real-time processing may also be stored
back into the data lake.
All of these possibilities for data delivery make the data lake
a much richer environment than what is provided by
integration tools built specifically to load a data warehouse.

The lake can be a target for data collection, an engine to transform data, a place to access data and a
means to push that data to databases or end-user tools. This is a complex and demanding environment that
requires new tools and technologies to match.
What Do You Need to Build Today?

The activities of acquisition, processing and delivery in a data lake are different from the activities supported
by most data integration tools today. One of the big questions asked by practitioners is, What does this
mean for the tools weve been using for the last ten years?
There is a tendency in the early stages of technology evolution to focus too much on the details of code and
components. This tendency shows up in the form of bill of materials thinking, where lists of components
are presented in various arrangements and defined as architecture. These are no more architecture than the
list of bricks, steel and cement are the architecture of a building.
8
The best way to see the challenge faced when building a data lake is to focus on integration in the Hadoop
environment. A common starting point is the idea of moving ETL and data processing from traditional tools
to Hadoop, then pushing the data from Hadoop to a data warehouse or database like Amazon Redshift so
that users can still work with data in a familiar way.
If we look at some of the specifics in this scenario, the problem of using a bill of materials as a technology
guide becomes apparent. For example, the processing of web event logs is unwieldy and expensive in a
database, so many companies shift this workload to Hadoop. Table 1 lists the components that could be
used to offload this work into Hadoop.
As with any set of components, there are tradeoffs in the choices. By choosing a specific language, the
choice of engine may be limited. For example, at the time of this writing Pig works with Tez and MapReduce
but not Spark. Choosing a specific engine may constrain the language, forcing developers to use an
unfamiliar or lower level language. The
Log Processing
Components Needed
tradeoffs extend to data storage. An
Requirement
optimal file format for storing the
Collect web log data and store it
Flume if the logs are files, or Kafka
for processing
if the events are streamed as they

occur
events may be incompatible with the

language and engine choices, yet
Get lookup data to de-reference
Sqoop with some hand-written
changing the format could sacrifice

compression, processing or read
codes and IDs in the web events
SQL to get data. HDFS files or
performance.
(e.g. product IDs, user IDs)
Hive tables to store and access

the data in Hadoop
The task is not complete after the data
Process the logs to extract the
Engine: MapReduce if processing
desired information
in batch, Spark or Tez if
processing components are integrated.

A job such as the example above is
processing at lower latency, Spark

Streaming or Storm if processing
events as they arrive
Code: pick the favored language
to program the engine with, e.g.
Pig, Java, Python, Scala
Schedule and manage the
Hcatalog, Apache Atlas or
hard to change or optimize. Therefore

more code is needed to handle
coordination and task management.
Apache Ambari, Cloudera

Navigator
Send the processed data to the
Sqoop or ftp and a database
data warehouse
loading tool
Table1: Log processing requirements in an ETL offload scenario and the

components that could be used to implement them in Hadoop.

9
handled by the system. It is bad form

to embed the data and task
dependencies in the code as this
makes coordinating jobs brittle and
Cloudera Navigator
Monitor the jobs
dependencies on data availability or on

other tasks, and these must be
Oozie, Luigi or Azkaban
collection of jobs
Metadata
not a single task but multiple tasks that

must be coordinated. Tasks will have
Once the development work is done

and the scheduling is worked out,
there is still the problem of monitoring
the execution. Hadoop offers plenty of
options to monitor jobs. The problem is
that monitoring components like
Ambari or Navigator do not provide complete application monitoring.

Data processing in this environment must be monitored from one end to the other, including the external
data sources or targets. The application relies on an orchestrated workflow.
Aside from basic monitoring and scheduling, logic is needed to determine what to do in the event of failures.
Should a task be restarted? Should the system continue running non-dependent tasks? How does the
scheduler detect a problem with a remote database that Sqoop is getting data from or sending data to? The
different Hadoop workflow schedulers and monitors have different strengths and weaknesses when it comes
to scheduling, dependency management and workflows. You can easily end up with too many overlapping
tools.
These sorts of tradeoffs lead to complexity in a data lake because different projects will have different
requirements. Each component of Hadoop is a standalone project but the data processing work has end-toend requirements. There are multiple components that can be substituted for one another, and several
components are required for any data processing task.
It is common to find that processes are embedded in the technology base and cant be refactored without
major side effects. Nor can the infrastructure be easily changed without disruption to applications already
built. Stringing together several components provides a capability, but only within the confines of one
application.
The real development challenge is the complexity of linking and orchestrating different components to
provide support for multiple applications.
This complexity is both the power and the curse of an architecture built around manually integrated
components. The developers spend too much time building and integrating the processing environment,
rather than focusing on the valuable work of the application.
Building a data lake requires thinking about the capabilities
needed in the system. This is a bigger problem than just
installing and using open source projects on top of
Hadoop. Just as data integration is the foundation of the
data warehouse, an end-to-end data processing capability
is the core of the data lake. The new environment needs a
new workhorse.
Just as data integration is the

foundation of the data warehouse,
an end-to-end data processing
capability is the core of the data lake.
The new environment needs a new
workhorse.
Conclusion
When you set out to build a data lake you are really creating a data platform on which other applications
depend. A data lake should last for years without significant redesign that impacts the applications it
supports. This is one of the capabilities the data warehouse got right.
To build this requires a new architecture and tools to enable the end-to-end data processing. The data lake
requires similar tools that were provided by the warehouse: tools to hide the underlying platform complexity,
10
orchestrate the components, and allow the developer to focus on the tasks that are important to build
applications.
How Snaplogic Fits Within a Data Lake

SnapLogic is the only unified data and application integration platform as a service (iPaaS). The SnapLogic
Elastic Integration Platform has 300+ pre-built connectors called Snaps - to connect everything
fromAWSRedshift to Zuora and a streaming architecture that supports real-time, event-based and lowlatency enterprise integration requirements plus the high volume, variety and velocity of big data integration in
the same easy-to-use interface.SnapLogics distributed, web-oriented architecture is a natural fit for
consuming and moving large data sets residing on premises, in the cloud, or both and delivering them to and
from the data lake.
The SnapLogic Elastic Integration Platform provides many of the core services of a data lake, including
workflow management, dataflow, data movement, and metadata.
Figure 2: SnapLogic accelerates the development of an enterprise data lake by automating data acquisition, transformation and access.
More specifically, SnapLogic accelerates development of a modern data lake through:

Data acquisition: collecting and integrating data from multiple sources. SnapLogic goes beyond developer
tools such as Sqoop and Flume with a cloud-based visual pipeline designer, and pre-built connectors for
300+ structured and unstructured data sources, enterprise applications and APIs.
Data transformation: adding information and transforming data. Minimize the manual tasks associated with
data shaping and make data scientists and analysts more efficient. SnapLogic includes Snaps for tasks
such as transformations, joins and unions without scripting.

11
Data access: organizing and preparing data for delivery and visualization. Make data processed on Hadoop
or Spark easily available to off-cluster applications and data stores such as statistical packages and
business intelligence tools.
SnapLogics platform-agnostic approach decouples data processing specification from execution. As data
volume or latency requirements change, the same pipeline can be used just by changing the target data
platform. SnapLogics SnapReduceenables SnapLogic to run natively on Hadoop as aYARN-managed
resource that elastically scales out to power big data analytics, while the Spark Snap helps users create
Spark-based data pipelines ideally suited for memory-intensive, iterative processes. Whether MapReduce,
Spark or other big data processing framework, SnapLogic allows customers to adapt to evolving data lake
requirements without locking into a specific framework.
Key Takeaways
One way to escape the spiral of increasing complexity and maintenance costs with Hadoop is to develop not an
application, but a data lake that integrates repeatable, reusable components and isolates the highly changeable
work done by applications. This is a shift from the idea of Hadoop as a database to the idea of Hadoop as the
engine inside a data lake.
Much of the web and non-transactional data is sent today as JSON messages. The more flexible infrastructure of
a data lake can support both streaming and batch. This makes it a better match for transporting and processing
data than databases and conventional data integration tools.
Hadoop is to data lake as database is to data warehouse. Hadoop is one piece of the bigger system that is a
data lake. It requires other components that are not - at least today - part of the base platform.
Supporting Hadoop or a data lake in an enterprise setting is a challenge today because of the many components
that must be integrated. Higher-order tools are required to escape this complexity.
The technology market is shifting away from manual integration and custom coding for each Hadoop application
to higher level tools because the goal isnt to write support code, the goal is to use it to build valuable
applications.

12
About SnapLogic
SnapLogic is the industrys first unified data and application integration platform as a service (iPaaS). The
SnapLogic Elastic Integration Platform enables enterprises to connect to any source, at any speed, anywhere
whether on premises, in the cloud or in hybrid environments. The easy-to-use platform empowers selfservice integrators, eliminates information silos, and provides a smooth onramp to big data. Founded by data
industry veteran Gaurav Dhillon and backed by leading venture investors, including Andreessen Horowitz and
Ignition Partners, SnapLogic is helping companies across the Global 2000 to connect faster. Learn more
about SnapLogic for big data integration at www.SnapLogic.com/bigdata.
About Third Nature

Third Nature is a research and consulting firm focused on new and emerging technology and practices in
analytics, information strategy and data management. Our goal is to help organizations solve problems
using data. We offer education, consulting and research services to support business and IT organizations.
2015 Third Nature, Inc. All Rights Reserved.

This publication may be used only as expressly permitted by license from Third Nature and may not be
accessed, used, copied, distributed, published, sold, publicly displayed, or otherwise exploited without the
express prior written permission of Third Nature. For licensing information, please contact us.

13

Will The Data Lake Drown The Data Warehouse

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Will The Data Lake Drown The Data Warehouse

Hochgeladen von

Copyright:

Verfügbare Formate

A

WILL THE DATA LAKE

This report is underwritten by SnapLogic, Inc.

Why the Data Lake? Why Not a Data Warehouse?

New Environment, New Requirements

Data structures: JSON is the new CSV

Data Processing Within a Data Lake

How Snaplogic Fits Within a Data Lake

A Third Nature Whitepaper

Why the Data Lake? Why Not a Data Warehouse?

The core capability of a data lake, and the source

solved. Experience with the use of a data warehouse

Transform datasets from one form into another

to ignore the new workloads because they are not

Manage data stored in and generated by the

either approach is to not meet the needs of the

Provide a platform for exploration of data

The history of data infrastructure is littered with designs

Provide a platform that enables complex

problem of the moment. The lesson of this approach

Support the full lifecycle of data from collection

warehouse eras was that systems intended to be data

end up as data junkyards where data goes to die.

limitations of unplanned yet long-lived systems by

New Environment, New Requirements

Data structures: JSON is the new CSV

Newer data integration tools can easily

tools can easily represent tables in JSON, whereas

Streams are the new batch

Different datasets drive new engines

The data lake must incorporate aspects of

discrete data sets that can be integrated as

Describing a Data Lake Architecture

Data Processing Within a Data Lake

The old extract, transform and load (ETL)

outside of rather than within the Hadoop

Processing and Management

Access and Delivery

All of these possibilities for data

possible for event processing jobs to publish output to a

integration tools built specifically to load a data warehouse.

What Do You Need to Build Today?

Flume if the logs are files, or Kafka

if the events are streamed as they

events may be incompatible with the

Get lookup data to de-reference

Sqoop with some hand-written

changing the format could sacrifice

codes and IDs in the web events

SQL to get data. HDFS files or

(e.g. product IDs, user IDs)

Hive tables to store and access

The task is not complete after the data

Process the logs to extract the

Engine: MapReduce if processing

in batch, Spark or Tez if

processing components are integrated.

processing at lower latency, Spark

Hcatalog, Apache Atlas or

hard to change or optimize. Therefore

Apache Ambari, Cloudera

Send the processed data to the

Sqoop or ftp and a database