Sie sind auf Seite 1von 10

CHECKLIST REPORT

2018

Data Management for


Data Lakes in the Cloud
By Philip Russom

Sponsored by:
FEBRUARY 2018

T DW I CHECK L IS T RE P OR T TABLE OF CONTENTS

Data Management for


Data Lakes in the Cloud 2 FOREWORD

3 NUMBER ONE
Catalog Data to Keep the Lake from Becoming a Swamp
By Philip Russom
4 NUMBER TWO
Address the Data Lake’s Aggressive Data Ingestion
Methods with DM Practice Adjustments and Data
Integration Tools

5 NUMBER THREE
Design a Cloud-Based or Hybrid Architecture for Your
Data Lake

6 NUMBER FOUR
Consider iPaaS as Tooling for Cloud DM

6 NUMBER FIVE
Make Your Cloud-Based Data Lake a Nexus for Sharing
Modern and Traditional Data, with a Focus on External
Sources

8 NUMBER SIX
Use the Data Lake as a Self-Service Data Exploration
Platform

9 ABOUT OUR SPONSOR

9 ABOUT THE AUTHOR

9 ABOUT TDWI RESEARCH

9 ABOUT TDWI CHECKLIST REPORTS

555 S. Renton Village Place, Ste. 700


Renton, WA 98057-3295

T 425.277.9126
F 425.687.2842
E info@tdwi.org © 2018 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.

tdwi.org Product and company names mentioned herein may be trademarks and/or registered trademarks
of their respective companies. Inclusion of a vendor, product, or service in TDWI research does not
constitute an endorsement by TDWI or its management. Sponsorship of a publication should not be
construed as an endorsement of the sponsor organization or validation of its claims.
TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

business processes through data; and compete and lead through


data. A cloud-based data lake can integrate old and new data at
massive scale in new and creative ways to enable the data discovery
FOREWORD and analytics that lead to data-driven business innovation.
Analytics is the data lake’s overarching use case.
Sometimes analytics is the sole use case, as when a lake is the core
This report discusses the leading data management (DM) best
of an analytics program or an extension of a data warehouse. Other
practices you need for data lakes to be successful when deployed in
times, analytics is a significant component within an operational
the cloud. It will drill into those best practices—which are a mix of
solution, as with marketing or supply chain data lakes. Some data
business and technology matters—and will look at numerous other
lakes are built purely for self-service data exploration and discovery,
success factors for lakes and clouds, with a focus on DM
which often lead to visualization or some other form of analytics.
requirements for platforms, architectures, analytics, and integration.
Hence, with a data lake, the path to business value usually leads
To set the scene, let’s start with an inspirational mission statement through analytics.
from a user who has designed and managed a cloud-based data
lake successfully. Then let’s discuss the mission. Data Management Challenges for Cloud-Based
Data Lakes
Our opportunity is to disrupt The cloud-based data lake has clear and compelling benefits.
our existing business processes However, it also faces many challenges in the realm of DM:
by leapfrogging standard data • A data lake needs solid semantics—metadata, glossaries, and
cataloging—to keep from becoming a swamp, despite the fact
architectures and adopting new and
that many big, new, external, and cloud data sources have no
leading data architectures to explore metadata or equivalent semantics.
business innovation and support our • A data lake must ingest highly diverse data, at scale, at multiple
digital journey. latencies, for immediate use, while interoperating with a long list
of systems, in the cloud and on premises.

Business and Technology Drivers for Cloud-Based • Data lakes tend to have complex multiplatform data
Data Lakes architectures (MDAs) and they exchange data with other MDAs.
DM must stitch together these complex hybrid environments.
Data lake disruption. Many organizations are facing a flood
of new data types and sources coming from big data, customer • DM tools used in hybrid environments should support traditional
channels, social media, the Internet of Things (IoT), and numerous best practices (integration, quality, master data management)
external sources (such as partners and third-party data providers). as well as new ones (microservices, orchestration) via an
They know they need to disrupt “business as usual” because older integrated tool platform. One new tool-based solution for this
DM best practices—and the ways a business gets value from problem is iPaaS.
data—don’t necessarily manage new data assets appropriately or • For the richest analytic correlations, a data lake should
generate business value. Organizations need disruptive database integrate traditional and modern data—both structured and
designs (such as data lakes) and modern computing platforms (such unstructured—from many sources at multiple latencies. The
as clouds) that are optimized for the aggressive ingestion and agile explosion of data from external sources is both a problem and an
use of new data assets. opportunity.
Leapfrogging standard data architectures. A lake or cloud • Data lakes mostly manage extracted or streamed source data
can breathe new life into established enterprise data architectures in its original raw form so it can be repurposed repeatedly for
(data warehouses, marketing channel data, digital supply chains) multiple use cases. Yet you must also present this data in forms
or create new and different ones (analytics labs and sandboxes, that are conducive to data exploration and other self-service
ecosystems of cloud-based operational applications). This way, practices for a wide range of user types.
traditional approaches continue to deliver value while modern
approaches pioneer new practices and value and enable the future. This report will now discuss these challenges and offer practical
solutions.
Data-driven business innovation. Nowadays, digital
enterprises serve and manage customers through data; imagine
and design new products through data; discover and deploy new

2  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

Data cataloging. In a lot of ways, data cataloging goes beyond


semantics—a straightforward description of data—to intelligence,
NUMBER ONE which is rich information about data and relations among data
CATALOG DATA TO KEEP THE LAKE FROM BECOMING elements. The intelligence can be interpreted to enrich browsing,
A SWAMP searches, queries, reports, and analytics, which in turn greatly
raises data’s value to users and the business. Modern intelligent
practices in data cataloging have much to offer data lake and cloud
An all-too-common misconception about data lakes is that they data users:
are dumping grounds for any random data that anyone wants 1. Holistic view. A mature catalog can literally visualize an
to put there. The problem with data dumping is that data is not entire data lake, or subsets of it, to illustrate the data
appropriately named and documented via semantics as it comes available and relations among data elements. Ideally, a data
into the data lake. Semantics includes various forms of metadata, catalog integrates with metadata repositories and business
cataloging, glossaries, tags, and file headers. The problem is glossaries for consistent terminology and definitions across the
exacerbated in cloud environments, which regularly deal with data holistic view.
from the Web, streams, and other external sources that notoriously
lack metadata or any other form of semantics. Additionally, 2. Single entity, cataloged multiple ways. Users may
developers cannot attach to external sources to extract metadata categorize data by data domain, source, lineage, personally
or its equivalent. identifiable information, compliance sensitivity, and so on. This
enables them to scan data by numerous criteria at the semantic
Experienced data professionals know that data dumping is a “worst level while reducing source data processing.
practice” that can turn a data lake into the so-called “data swamp.”
Due to the lack of proper semantics, a data swamp is difficult to 3. Multiple data access methods via the catalog.
use, trust, govern, browse, query, or analyze. Hence, getting full Every user and use case is different. To satisfy diverse needs,
business use and value from a cloud-based data lake requires a catalog should provide access via browsing, searching,
significant semantics, as follows: and querying.

Next-generation metadata management. Because most tool 4. Crowd-source the development of data intelligence.
types and user query practices require metadata, tools should be Users don’t just access a data catalog; they help to develop it.
in place to develop, capture, and automatically deduce technical Users can enter data elements into the catalog; review and rank
metadata. These scenarios involve traditional development as well elements by quality, trust, and usability; and curate entries in
as new situations. the catalog for accuracy or proper placement in a taxonomy
or hierarchy.1
For example, as users explore a lake’s data (a common business
requirement for a lake), the tool should help them develop metadata General semantics requirements. Whether metadata, glossary,
as they go. For metadata-free source data, a tool should parse the or catalog, semantics should be centralized for a single view that is
incoming file, data set, or message to deduce its schema and turn accessible by broadly deployed users and technologies. Recognize
that into reusable technical metadata. This valuable automation that data lake users will demand self-service data practices (data
may be enabled by a rules engine or by an artificial intelligence and exploration, preparation, visualization) and these are not possible
machine learning (AI/ML) algorithm, and it may be curated by a user without friendly and broad semantics. For all forms of semantics
or applied autonomously without human intervention. (and many other DM tasks), you should demand tools that support
productive automation through a rules or an AI/ML engine.
Business metadata and business glossaries. The number of
business users and other nontechnical users continues to increase.
They demand easy-to-use tools and data semantics that will enable
them to access and use a lake’s data autonomously. At a minimum,
they need business metadata, which translates technical metadata
into business-friendly descriptions of data that they can understand.
On a more sophisticated level, this class of user is progressively
demanding a business glossary with which they can work
collaboratively with others to create, define, and apply terms
describing common business entities, such as customer, profitability,
and production yield.

1
For a detailed discussion of modern data cataloging, read the TDWI
Checklist Report The Data Catalog’s Role in the Digital Enterprise, online
at tdwi.org/checklists.
3  TDWI RESE A RCH tdwi.org
TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

Unpredictable ingestion and processing loads. One reason


data lakes are being deployed more on clouds is to take advantage
NUMBER TWO of cloud’s automatic allocation of resources as workloads in the lake
ADDRESS THE DATA LAKE’S AGGRESSIVE DATA INGESTION ramp up and subside. Cloud-based integration processing can do
METHODS WITH DM PRACTICE ADJUSTMENTS AND DATA the same.
INTEGRATION TOOLS Rich library of APIs and open interfaces. A data lake of any
maturity will ingest data from a long list of tools and platforms,
which means its data integration has to support many demanding
Lake ingestion methods differ from other database- interfaces. Besides the usual enterprise applications, this also
driven practices. involves SaaS apps such as Marketo, Salesforce, and NetSuite, as
A hallmark of the data lake is its unique approach to data ingestion. well as cloud-based PaaS providers such as Amazon, Google, and
On the one hand, a data lake of any maturity will continuously ingest Microsoft. Given the mass and complexity of interfaces, a modern
large volumes of data from hundreds of sources. On the other hand, data integration tool should support API management.
a lake stores most data in its raw source form so that the original Automation for the rapid addition of data sources. Data
details are preserved for future analytics, reporting, and operations lakes and cloud DM regularly take on data from new sources. For
that may need them. example, in industries involved in IoT (such as logistics, utilities,
To satisfy these two DM requirements, the data integration and and supply chains), new machines, devices, vehicles, and sensors
other processes feeding the lake store incoming data as quickly as can come online every week, generating valuable data for both
possible, with little or no alteration or improvement. This allows the operational and analytics use. To get to such business value quickly,
ingestion process to perform and scale by delaying data processing the data integration tool must have easy, agile, and repeatable
workloads until read time. If a lake’s data needs processing to be functionality for source onboarding. Ideally, onboarding should be
repurposed for various uses cases, that processing occurs later, automated via rules or machine learning.
typically well after ingestion and increasingly on the fly as data is Judicious data quality. As discussed earlier, data lakes tend
explored or a report is refreshed. to store data in its original form, without improvement until later.
The data lake’s “process on read” strategy differs from the data Even so, over time, data professionals managing the lake will learn
warehouse’s “process before ingestion” strategy, which is fine (as they would with any data set) where data standardization is
because the two serve complementary use cases with different needed to facilitate self-service, data exploration, reporting, or query
requirements. A data lake is a raw detail store supporting performance. Such improvements should be applied to documented
mostly discovery-oriented analytics and which by nature accepts copies of data so details of the original source are not lost. When
whatever data models, quality states, and semantics it gets. lake data is cataloged properly, information captured in the catalog
A data warehouse is a store of mostly calculated and carefully can be the equivalent of standardization, but performed at the
structured values for recurring reports which have rather stringent semantic level without altering actual data.
requirements and standards for modeling, quality, accuracy, and Data landing and staging on steroids. In many ways, a data
audit. Hence, data warehouse professionals have the skills they need lake is like a massive data landing area, with extreme levels of
for data lakes but will need to adjust them slightly to align with the ingestion but low levels of staging. For this reason, many data lakes
unique requirements of the data lake. serve the primary purpose of discovery analytics while also serving
as data landing and staging for other systems, such as a data
Most challenges of the lake’s data ingestion warehouse or channel marketing data ecosystem.
methods are met by data integration tools.
Speed and scale. Most data lakes are ingesting data around the
clock, often at Internet volumes. Achieving speed and scale depends
on a data integration platform that can parallelize all jobs for
maximum throughput.
Containers from files to messages. A cloud-based data lake
will ingest data from traditional on-premises sources and modern
Web-based ones, which amounts to many container types (including
XML and JSON files), SQL result sets, bulk and block data sets, and
streaming events and messages. The data integration tool must
ingest all these quickly and at scale.

4  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

The point of the MDA is to provide options for data’s


NUMBER THREE
diverse structures and uses.
DESIGN A CLOUD-BASED OR HYBRID ARCHITECTURE FOR For example, as businesses deploy more sensors, customer channels,
YOUR DATA LAKE applications, and social media, the breadth of source data’s schema,
latencies, and containers is driven up, which in turn requires
more types of data platforms to capture and process the diverse
One of the strongest—and most challenging—trends in data data appropriately. Similarly, relational and other structured data
management today is toward environments consisting of numerous types are being joined by a widening range of multistructured and
data platforms where data is physically distributed across multiple unstructured data types.
database servers, file systems, and storage. This is certainly the As another example, businesses are diversifying the use of analytics,
case with most data lakes, especially those involving clouds. When reports, and data-driven business monitoring, all of which have
diverse data platforms and data sets are integrated this way, the unique requirements for data capture, storage, and processing.
result is called a multiplatform data architecture (MDA). Synonyms These in turn drive up the diversity of data platforms and related
include distributed data architecture and hybrid data ecosystem.2 tools. TDWI sees users succeeding with MDAs in data warehousing,
analytics, multichannel marketing, digital supply chain, IoT, and
The trend toward MDAs affects data lakes in other data-driven enterprise programs.
multiple ways.
A data lake is a data architecture. Note that a data lake is MDA complexity is the problem.
a method for managing data, not a platform per se. The method Extreme complexity results from the number of systems involved,
assumes you are willing and able to create a large-scale architecture multiplied by the extreme diversity of platform types that may
(sometimes called a design pattern) for your data lake, similar to integrate multiple brands of database management systems (both
how you’ve modeled and architected databases, data warehouses, old and new), NoSQL platforms (especially Hadoop), and tools for
data hubs, and data stores. A lake’s architecture may involve one data integration, analytics, and stream processing. These may be
or several data platforms and—like most architectures—it has deployed on premises, in the cloud, or in hybrid combinations.
multiple components and layers. Furthermore, the lake’s architecture
may overlap with other MDAs, as when a lake extends a data The complexity of MDAs is amplified by the growing popularity of
warehouse environment. popular cloud platforms in data-driven solutions, namely Amazon
Web Services (AWS), Google Cloud, and Microsoft Azure. Even more
Most data lakes are hybrid MDAs. Research by TDWI has dramatic is the surging portfolio of cloud-based SaaS applications
revealed that more than half of data lakes are hybrids, spanning users are subscribing to, including Salesforce, Marketo, NetSuite,
a mix of Hadoop clusters, file systems (e.g., Amazon S3 and Azure Paylocity, and Workday. We say “the cloud” as if there is only one,
DLS), and diverse brands of relational database management but many organizations are managing data across many cloud-based
systems (RDBMSs). Many lakes involve data platforms located both data platforms and SaaS apps.
on premises and in the cloud. Hence, you should probably assume a
hybrid MDA when designing your data lake.3 There are solutions to the MDA complexity problem.
The cloud is an emerging data lake platform. As cloud DM tools. Although data is strewn about physically, there are
usage for DM has matured, user organizations have started usually cross-platform architectural layers that stitch together the
leveraging a variety of data systems that most cloud providers MDA and its data. For example, architectural layers for DM can unify
support. For example, at TDWI we see users turning to Hadoop, an MDA, especially the layers for metadata and other semantics.
Amazon S3, and Azure DLS for cloud-based data lakes, as well Virtual DM techniques such as federation and data virtualization can
as Redshift, Snowflake, and SQL Server on Azure for cloud-based create views that make data look simpler and more unified than it
data warehousing. Hence, most MDAs already include cloud-based actually is. Data travels—a lot—among the platforms of an MDA,
systems and they will no doubt involve more over time, especially for such that a well-constructed network of data integration data flows
data lakes, warehouses, and analytics. constitutes yet another unifying architectural layer.
Another emerging best practice for orchestrating data integration
data flows is the integration hub, which employs a publish-and-
2
For a detailed definition of MDAs, replay the archived TDWI Webinar
subscribe method of sourcing and distributing data. An integration
Defining the Multiplatform Data Architecture and What It Means to You, hub makes it easier to introduce and maintain data and avoids the
online at tdwi.org/webinars. integration morass that too many data architectures deteriorate into.
3
For a detailed rundown of lake platforms, see the TDWI Best Practices Note all these solutions to the MDA complexity problem require a
Report Data Lakes: Purposes, Practices, Patterns, and Platforms, online at
tdwi.org/bpreports.
well-integrated suite of diverse DM tools.

5  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

Architecture and governance teams. Tools aside, data from burning up valuable time and personnel on system integration.
architects are indispensable to the success of an MDA, including This way, cloud-based systems present minimal time until business
those responsible for data lakes and clouds. An enterprise data use. In many cases, users can set up interfaces, load data, migrate
architect (or data warehouse architect, etc.) typically heads a team users, and put the solution into production in a few days.
of people who influence the selection of data platforms and DM tools
All forms of integration. Vendor products for iPaaS vary, but
relative to MDAs. The team also fosters enterprise data standards
the comprehensive ones support the many functions of both data
that facilitate the integration of data across platforms, whether
integration and application integration. This is because, in addition
on premises or in the cloud. A data governance or stewardship
to data integration, many users need a data-driven toolset for
program may provide similar assistance with data standards and
reliably migrating, consolidating, and delivering application data,
architectural preferences.
plus managing data from SaaS apps.
Multitool suite. IPaaS suites tend to include many integration
tool types through a single development and management console,
NUMBER FOUR including those for data integration, quality, and master data
CONSIDER IPAAS AS TOOLING FOR CLOUD DM management (MDM); plus application integration, orchestration,
and process management. The unified toolset boosts developer
productivity, the creation of consistent standards, and broad
curation. Furthermore, it fosters the modern design of flows and
Integration Platform-as-a-Service (iPaaS) pipelines that incorporate multiple forms of integration technology.
is a response to new business and technical Microservices. IPaaS functionality is available via cloud-based
requirements. microservices. Almost any data- or application-integration function
Many user organizations have compelling reasons or executive you can think of is now a service.
mandates to move to the cloud. They seek to follow cloud-first API driven. To fulfill its aggressive integration goals, an
policies, control IT costs, integrate disparate applications, deliver iPaaS suite must support all modern and traditional application
data-driven solutions faster, and provide integration infrastructure programming interfaces (APIs), in addition to including special
for complex multiplatform data environments that include clouds. functionality for managing API portfolios and performance.
Organizations facing any combination of these evolving business Built for hybrid environments. Despite the focus on the
and technology requirements are realizing that their traditional cloud, iPaaS also provides integration microservices that can be
on-premises integration solutions are not a good fit for fast-paced used by on-premises applications and tools. In fact, organizations
cloud operations or complex hybrid environments. Furthermore, users with iPaaS typically use it as a nexus that provides rich integration
migrating data and applications to the cloud need richer cloud- and interoperability for the many platforms of a hybrid data and
based integration toolsets both to facilitate the migration and to application environment. Given that many data lakes are cloud
support daily cloud-native integration flows. As a result, Integration based and hybrid, iPaaS can be an appropriate DM solution
Platform-as-a-Service (iPaaS) has emerged to address today’s for them.4
requirements. IPaaS is a suite of cloud microservices enabling the
development, execution, and governance of integration flows and
data pipelines. It also enables modern, cloud-based DM, which can
NUMBER FIVE
ably apply to cloud-based data lakes.
MAKE YOUR CLOUD-BASED DATA LAKE A NEXUS FOR
DM requirements for iPaaS SHARING MODERN AND TRADITIONAL DATA, WITH A
In a nutshell, here are the essential requirements for iPaaS, plus FOCUS ON EXTERNAL SOURCES
some of its uses and benefits:
Powered by the cloud. IPaaS is cloud based to take advantage One of the fastest growing trends in information technology today
of cloud elasticity, scalability, flexibility, and low cost. This also concerns the burgeoning number of parties and applications that
allows it to interface directly with Internet-based applications are external to an enterprise. Most of these generate data and may
(Salesforce, Marketo, NetSuite) and data sources (Web apps, B2B demand data in return, and that data is critical to business success.
partners, IoT). Therefore, data management professionals are under pressure to
Fast track to integration solutions. IPaaS can accelerate
application and data integration. For example, when a public cloud 4
For more information, read the TDWI Checklist Report Data Management
has multiple tools already set up and optimized, it prevents users Best Practices for Cloud and Hybrid Architectures, online at tdwi.org/checklists.
Also, replay the archived TDWI webinar Achieving Integration Agility, Scale, and
Simplicity via Cloud-Based Integration Platform-as-a-Service, available online at
tdwi.org/webinars.
6  TDWI RESE A RCH tdwi.org
TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

capture and manage data coming from sources and technologies from and depart through the Internet, a cloud-based data lake can
that are new to them. They must also support new business use be proximity appropriate.
cases by processing data in ways that are likewise new. Here are
The Internet of Things. TDWI has a number of members who
several of the high-profile and high-value use cases for external
work in logistics firms, especially those for truck and rail freight.
data sources that are well-served by cloud-based data lakes.
These firms have invested significantly in telematics and other
General analytics. Analytics is the overarching use case for sensor systems for their trucks, trailers, locomotives, and railcars.
almost all data lakes. Analytics is also mostly about correlating They’ve also deployed data gathering stations in their truck
disparate data points, as seen in advanced techniques such as depots and rail yards. Some sensors stream data continuously
data mining, clustering, and graph databases. A data lake can (as with GPS) while others are intermittent (as when an RFID chip
bring together massive quantities of data from many sources, comes into proximity of an RFID reader). Every sensor and read
structures, and latencies for the richest cross-source correlations context is different, meaning that data falls into many schemas,
possible. These rich correlations in turn lead to richer analytics, some standard and some proprietary. Sensor data may be
more comprehensive reporting, more complete views of customers, transmitted over the Internet (often via WiFi), telephony, satellites,
more discoveries of fraud, and more revelations about business or enterprise networks.
characteristics. When the sources required for such an eclectic data
When a data lake’s DM solution supports many ingestion methods,
mix are distributed across enterprise locations and Internet-based
it can handle incoming IoT data regardless of its latency, schema,
applications and parties, deploying a data lake in the cloud can
or transmission infrastructure. The lake and its DM solution can
be an advantage for creating a shared, multitenant intersection of
also have data ready for immediate use, which is important because
extremely diverse data and its analytics.
most IoT data has urgent operational functions (tracking shipments,
Multichannel marketing data. TDWI sees this as the next- spotting vehicle malfunctions, monitoring operator behavior, and
largest use case for data lakes after general analytics. The so-called operational reporting) as well as latent analytics uses (optimizing
marketing data lake is gaining adoption because it provides a single routes, vehicle maintenance, and customer service).
repository for the many data-driven functions of modern marketing,
Deploying the data lake in the cloud makes sense when IoT data is
including data from numerous channels and customer touch points,
shared with people outside the enterprise, usually through browser-
360-degree customer views, third-party data about consumers,
based applications. For example, manufacturers of high-end farm
campaign design and execution, and customer analytics (profiling,
machinery collect data from combines so the combine owner can
segmentation, profitability, etc.). Some customer channels and
analyze the vehicle’s performance and his use of the machine. Some
touch points are inherently Internet-based, especially those involving
utility firms share data about electricity consumption so property
website visitor behavior and e-commerce. External data providers
owners can understand and optimize their consumption.
are likewise on the Internet, ranging from traditional consumer
demographics to modern social media.
Furthermore, marketing departments tend to prefer SaaS-based
NUMBER SIX
applications for sales force automation and marketing campaign
management, both of which involve larger regular data loads and USE THE DATA LAKE AS A SELF-SERVICE DATA EXPLORATION
extracts across the Internet than even most other SaaS applications. PLATFORM
Additionally, marketing data solutions are not complete without
functions for MDM and data quality. Finally, these data-driven,
digital approaches to marketing are universally practiced by large As we just saw, the onslaught of new data from new sources is
firms, where marketers are geographically distributed but need to both a DM problem and a business opportunity. The cloud-based
share large volumes of data easily. All these data types, sources, data lake is a likely solution for capturing new data. How does an
and scenarios point to the cloud-based data lake, augmented with organization get beyond data collection to business value?
modern DM and analytics tooling, possibly with an iPaaS platform.
Data exploration leads to business value and
Business-to-business (B2B) partner data. Procurement compliance.
and supply chain operations are mission-critical in manufacturing
and retail industries. Yet these industries have only recently begun Data exploration (sometimes called data discovery) is a recently
serious modernization efforts as they move from faxes and phone emerged best practice that is critical to wringing business value
calls to managed file transfers and automated transactions. Data from data that is new to an organization. Diverse users are now
lakes are easily optimized for file-based data and so are a good exploring data regularly, from technical people such as data analysts
fit for landing, processing, archiving, and analyzing partner data and scientists, to business domain experts such as marketers,
exchanged via EDI, XML, and JSON files. Because these files arrive accountants, and procurement specialists.

7  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

With new data, there is the prospect of new insight. A desirable your peril because without them, most users will not use a data lake
outcome of data exploration is the discovery of facts about the and it will be considered a failure.
business or one of its processes that were previously unknown.
Note that self-service data exploration often leads directly into
However, users also explore to simply understand the new data.
other self-service practices and tool types. For example, once a user
Once they see which business domains, entities, and activities
has explored individual records of customers who churned recently,
are represented in it, they can decide how this information can
he or she might immediately move to a self-service data prep tool
be applied to solutions in DM, operations, and analytics. This
(which ideally is in the same toolset as the exploration tool) to pull
understanding also helps them decide which data governance,
all similar records into a data set, then augment the data set with
privacy, and security policies to apply to the new data.
demographics of these and similar customers. The user then applies
Data profiling leads to DM solutions that deliver self-service data visualization to the data set to get a different or
quality data. deeper understand of the current form of churn. Once the data set
and its visualization are polished, the user may publish them via
The technical adjunct to exploration is profiling. Technical data self-service so other users can subscribe to them and understand
management professionals (or special business users such as the latest form of customer churn.
data stewards) profile new data to understand its technical state
in terms of structure (format, model, or schema); metadata and A data lake and its tooling can enable these self-service tasks.
other semantics (whether embedded, extracted, or deduced); The lake provides the data and the semantics appropriate to
condition (quality, standardization, completeness); and details of exploration. The lake and its DM tools enable the queries that are
delivery or storage (files, streams, preferred interfaces, and so on). at the heart of most exploration and data prep. The lake can
This information—called a data profile or data intelligence—is store self-service data sets, preferably in special areas such as
indispensable for getting started with new data as well as for sandboxes, so that various users and tools can subscribe to sandbox
improving, storing, and presenting it for fullest business use. content. To avoid swamp practices such as abandoned data sets,
the lake’s tools should automatically delete sandboxes that lie
A data lake can enable broad data exploration unused for long periods of time.
and profiling.
Data exploration and data profiling usually require detailed source
data, and large quantities of it—features at which all data lakes
excel. The multiple ingestion methods supported by most lakes also
mean they excel at landing and staging all kinds of data; hence,
a data lake is ideal for the first capture of new data and the first
study of it.
New data aside, the average data lake contains highly diverse
data, both modern and traditional, from both internal and external
sources. The lake’s eclectic mix of data makes it ideal for data
exploration sessions in which the user develops analytic correlations
among disparate data points in hopes of discovering connections.
For example, correlations can reveal a person who is present at
multiple accidents (suggesting fraud), seasonal conditions that
lead to damages (enlightening actuaries), or bottleneck steps in a
process (suggesting efficiency improvements).

Data exploration, data prep, and other data lake-


based self-service practices are coalescing into
a unified analytics approach demanded by both
business and technology users.
As mentioned, self-service business practices are on the rise among
business domain experts and they expect it with their data lakes.
These users need a combined data lake-and-DM solution that
provides well-developed business-friendly semantics. They also need
end-user tools with strong ease-of-use. Ignore these requirements at

8  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD

ABOUT OUR SPONSOR ABOUT THE AUTHOR

Philip Russom, Ph.D., is senior director of


TDWI Research for data management and is
a well-known figure in data warehousing,
integration, and quality, having published
informatica.com over 550 research reports, magazine articles,
opinion columns, and speeches over a
Informatica is the only enterprise cloud data management leader 20-year period. Before joining TDWI in 2005, Russom was an
that accelerates data-driven digital transformation. Informatica industry analyst covering data management at Forrester Research
enables companies to unleash the power of data to fuel innovation, and Giga Information Group. He also ran his own business as an
become more agile, and realize new growth opportunities, resulting independent industry analyst and consultant, was a contributing
in intelligent market disruptions. With over 7,000 customers editor with leading IT magazines, and was a product manager at
worldwide, Informatica is the trusted leader in enterprise cloud database vendors. His Ph.D. is from Yale. You can reach him at
data management. prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at
Businesses must transform to stay relevant and data holds the linkedin.com/in/philiprussom.
answers. We’re prepared to help you intelligently lead—in any
sector, category, or niche. Because our focus is 100% on everything
data, we offer the versatility needed to succeed. We invite you to
explore all that Informatica has to offer—and unleash the power of
data to drive your next intelligent disruption. ABOUT TDWI RESEARCH
With full access to data, IT roles shift dramatically to become
more strategic, more essential, to become partners in leading TDWI Research provides research and advice for BI professionals
the business. With Informatica, you can use data to develop new worldwide. TDWI Research focuses exclusively on analytics and
business models and capture growth opportunities. We enable you to data management issues and teams up with industry practitioners
unleash the power of data to intelligently disrupt your industry. to deliver both broad and deep understanding of the business and
The Informatica Intelligent Data Platform is the industry’s technical issues surrounding the deployment of business intelligence
most complete and modular solution, built on a microservices and data management solutions. TDWI Research offers reports,
architecture, to help companies unleash the power and value commentary, and inquiry services via a worldwide membership
of all data across the hybrid enterprise. The AI-driven platform program and provides custom research, benchmarking, and strategic
spans on-premises, cloud and big data anywhere—ensuring planning services to user and vendor organizations.
data is trusted, secure, governed, accessible, timely, relevant and
actionable. This enables the world’s most progressive companies to
deliver data-driven digital transformation outcomes.

ABOUT TDWI CHECKLIST REPORTS

TDWI Checklist Reports provide an overview of success factors for


a specific project in business intelligence, data warehousing, or
a related data management discipline. Companies may use this
overview to get organized before beginning a project or to identify
goals and areas of improvement for current projects.

9  TDWI RESE A RCH tdwi.org

Das könnte Ihnen auch gefallen