Beruflich Dokumente
Kultur Dokumente
2018
Sponsored by:
FEBRUARY 2018
3 NUMBER ONE
Catalog Data to Keep the Lake from Becoming a Swamp
By Philip Russom
4 NUMBER TWO
Address the Data Lake’s Aggressive Data Ingestion
Methods with DM Practice Adjustments and Data
Integration Tools
5 NUMBER THREE
Design a Cloud-Based or Hybrid Architecture for Your
Data Lake
6 NUMBER FOUR
Consider iPaaS as Tooling for Cloud DM
6 NUMBER FIVE
Make Your Cloud-Based Data Lake a Nexus for Sharing
Modern and Traditional Data, with a Focus on External
Sources
8 NUMBER SIX
Use the Data Lake as a Self-Service Data Exploration
Platform
T 425.277.9126
F 425.687.2842
E info@tdwi.org © 2018 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.
tdwi.org Product and company names mentioned herein may be trademarks and/or registered trademarks
of their respective companies. Inclusion of a vendor, product, or service in TDWI research does not
constitute an endorsement by TDWI or its management. Sponsorship of a publication should not be
construed as an endorsement of the sponsor organization or validation of its claims.
TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD
Business and Technology Drivers for Cloud-Based • Data lakes tend to have complex multiplatform data
Data Lakes architectures (MDAs) and they exchange data with other MDAs.
DM must stitch together these complex hybrid environments.
Data lake disruption. Many organizations are facing a flood
of new data types and sources coming from big data, customer • DM tools used in hybrid environments should support traditional
channels, social media, the Internet of Things (IoT), and numerous best practices (integration, quality, master data management)
external sources (such as partners and third-party data providers). as well as new ones (microservices, orchestration) via an
They know they need to disrupt “business as usual” because older integrated tool platform. One new tool-based solution for this
DM best practices—and the ways a business gets value from problem is iPaaS.
data—don’t necessarily manage new data assets appropriately or • For the richest analytic correlations, a data lake should
generate business value. Organizations need disruptive database integrate traditional and modern data—both structured and
designs (such as data lakes) and modern computing platforms (such unstructured—from many sources at multiple latencies. The
as clouds) that are optimized for the aggressive ingestion and agile explosion of data from external sources is both a problem and an
use of new data assets. opportunity.
Leapfrogging standard data architectures. A lake or cloud • Data lakes mostly manage extracted or streamed source data
can breathe new life into established enterprise data architectures in its original raw form so it can be repurposed repeatedly for
(data warehouses, marketing channel data, digital supply chains) multiple use cases. Yet you must also present this data in forms
or create new and different ones (analytics labs and sandboxes, that are conducive to data exploration and other self-service
ecosystems of cloud-based operational applications). This way, practices for a wide range of user types.
traditional approaches continue to deliver value while modern
approaches pioneer new practices and value and enable the future. This report will now discuss these challenges and offer practical
solutions.
Data-driven business innovation. Nowadays, digital
enterprises serve and manage customers through data; imagine
and design new products through data; discover and deploy new
Next-generation metadata management. Because most tool 4. Crowd-source the development of data intelligence.
types and user query practices require metadata, tools should be Users don’t just access a data catalog; they help to develop it.
in place to develop, capture, and automatically deduce technical Users can enter data elements into the catalog; review and rank
metadata. These scenarios involve traditional development as well elements by quality, trust, and usability; and curate entries in
as new situations. the catalog for accuracy or proper placement in a taxonomy
or hierarchy.1
For example, as users explore a lake’s data (a common business
requirement for a lake), the tool should help them develop metadata General semantics requirements. Whether metadata, glossary,
as they go. For metadata-free source data, a tool should parse the or catalog, semantics should be centralized for a single view that is
incoming file, data set, or message to deduce its schema and turn accessible by broadly deployed users and technologies. Recognize
that into reusable technical metadata. This valuable automation that data lake users will demand self-service data practices (data
may be enabled by a rules engine or by an artificial intelligence and exploration, preparation, visualization) and these are not possible
machine learning (AI/ML) algorithm, and it may be curated by a user without friendly and broad semantics. For all forms of semantics
or applied autonomously without human intervention. (and many other DM tasks), you should demand tools that support
productive automation through a rules or an AI/ML engine.
Business metadata and business glossaries. The number of
business users and other nontechnical users continues to increase.
They demand easy-to-use tools and data semantics that will enable
them to access and use a lake’s data autonomously. At a minimum,
they need business metadata, which translates technical metadata
into business-friendly descriptions of data that they can understand.
On a more sophisticated level, this class of user is progressively
demanding a business glossary with which they can work
collaboratively with others to create, define, and apply terms
describing common business entities, such as customer, profitability,
and production yield.
1
For a detailed discussion of modern data cataloging, read the TDWI
Checklist Report The Data Catalog’s Role in the Digital Enterprise, online
at tdwi.org/checklists.
3 TDWI RESE A RCH tdwi.org
TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD
Architecture and governance teams. Tools aside, data from burning up valuable time and personnel on system integration.
architects are indispensable to the success of an MDA, including This way, cloud-based systems present minimal time until business
those responsible for data lakes and clouds. An enterprise data use. In many cases, users can set up interfaces, load data, migrate
architect (or data warehouse architect, etc.) typically heads a team users, and put the solution into production in a few days.
of people who influence the selection of data platforms and DM tools
All forms of integration. Vendor products for iPaaS vary, but
relative to MDAs. The team also fosters enterprise data standards
the comprehensive ones support the many functions of both data
that facilitate the integration of data across platforms, whether
integration and application integration. This is because, in addition
on premises or in the cloud. A data governance or stewardship
to data integration, many users need a data-driven toolset for
program may provide similar assistance with data standards and
reliably migrating, consolidating, and delivering application data,
architectural preferences.
plus managing data from SaaS apps.
Multitool suite. IPaaS suites tend to include many integration
tool types through a single development and management console,
NUMBER FOUR including those for data integration, quality, and master data
CONSIDER IPAAS AS TOOLING FOR CLOUD DM management (MDM); plus application integration, orchestration,
and process management. The unified toolset boosts developer
productivity, the creation of consistent standards, and broad
curation. Furthermore, it fosters the modern design of flows and
Integration Platform-as-a-Service (iPaaS) pipelines that incorporate multiple forms of integration technology.
is a response to new business and technical Microservices. IPaaS functionality is available via cloud-based
requirements. microservices. Almost any data- or application-integration function
Many user organizations have compelling reasons or executive you can think of is now a service.
mandates to move to the cloud. They seek to follow cloud-first API driven. To fulfill its aggressive integration goals, an
policies, control IT costs, integrate disparate applications, deliver iPaaS suite must support all modern and traditional application
data-driven solutions faster, and provide integration infrastructure programming interfaces (APIs), in addition to including special
for complex multiplatform data environments that include clouds. functionality for managing API portfolios and performance.
Organizations facing any combination of these evolving business Built for hybrid environments. Despite the focus on the
and technology requirements are realizing that their traditional cloud, iPaaS also provides integration microservices that can be
on-premises integration solutions are not a good fit for fast-paced used by on-premises applications and tools. In fact, organizations
cloud operations or complex hybrid environments. Furthermore, users with iPaaS typically use it as a nexus that provides rich integration
migrating data and applications to the cloud need richer cloud- and interoperability for the many platforms of a hybrid data and
based integration toolsets both to facilitate the migration and to application environment. Given that many data lakes are cloud
support daily cloud-native integration flows. As a result, Integration based and hybrid, iPaaS can be an appropriate DM solution
Platform-as-a-Service (iPaaS) has emerged to address today’s for them.4
requirements. IPaaS is a suite of cloud microservices enabling the
development, execution, and governance of integration flows and
data pipelines. It also enables modern, cloud-based DM, which can
NUMBER FIVE
ably apply to cloud-based data lakes.
MAKE YOUR CLOUD-BASED DATA LAKE A NEXUS FOR
DM requirements for iPaaS SHARING MODERN AND TRADITIONAL DATA, WITH A
In a nutshell, here are the essential requirements for iPaaS, plus FOCUS ON EXTERNAL SOURCES
some of its uses and benefits:
Powered by the cloud. IPaaS is cloud based to take advantage One of the fastest growing trends in information technology today
of cloud elasticity, scalability, flexibility, and low cost. This also concerns the burgeoning number of parties and applications that
allows it to interface directly with Internet-based applications are external to an enterprise. Most of these generate data and may
(Salesforce, Marketo, NetSuite) and data sources (Web apps, B2B demand data in return, and that data is critical to business success.
partners, IoT). Therefore, data management professionals are under pressure to
Fast track to integration solutions. IPaaS can accelerate
application and data integration. For example, when a public cloud 4
For more information, read the TDWI Checklist Report Data Management
has multiple tools already set up and optimized, it prevents users Best Practices for Cloud and Hybrid Architectures, online at tdwi.org/checklists.
Also, replay the archived TDWI webinar Achieving Integration Agility, Scale, and
Simplicity via Cloud-Based Integration Platform-as-a-Service, available online at
tdwi.org/webinars.
6 TDWI RESE A RCH tdwi.org
TDWI CHECKLIST REPORT: DATA M A N AGEMENT FOR DATA L A K ES IN THE CLOUD
capture and manage data coming from sources and technologies from and depart through the Internet, a cloud-based data lake can
that are new to them. They must also support new business use be proximity appropriate.
cases by processing data in ways that are likewise new. Here are
The Internet of Things. TDWI has a number of members who
several of the high-profile and high-value use cases for external
work in logistics firms, especially those for truck and rail freight.
data sources that are well-served by cloud-based data lakes.
These firms have invested significantly in telematics and other
General analytics. Analytics is the overarching use case for sensor systems for their trucks, trailers, locomotives, and railcars.
almost all data lakes. Analytics is also mostly about correlating They’ve also deployed data gathering stations in their truck
disparate data points, as seen in advanced techniques such as depots and rail yards. Some sensors stream data continuously
data mining, clustering, and graph databases. A data lake can (as with GPS) while others are intermittent (as when an RFID chip
bring together massive quantities of data from many sources, comes into proximity of an RFID reader). Every sensor and read
structures, and latencies for the richest cross-source correlations context is different, meaning that data falls into many schemas,
possible. These rich correlations in turn lead to richer analytics, some standard and some proprietary. Sensor data may be
more comprehensive reporting, more complete views of customers, transmitted over the Internet (often via WiFi), telephony, satellites,
more discoveries of fraud, and more revelations about business or enterprise networks.
characteristics. When the sources required for such an eclectic data
When a data lake’s DM solution supports many ingestion methods,
mix are distributed across enterprise locations and Internet-based
it can handle incoming IoT data regardless of its latency, schema,
applications and parties, deploying a data lake in the cloud can
or transmission infrastructure. The lake and its DM solution can
be an advantage for creating a shared, multitenant intersection of
also have data ready for immediate use, which is important because
extremely diverse data and its analytics.
most IoT data has urgent operational functions (tracking shipments,
Multichannel marketing data. TDWI sees this as the next- spotting vehicle malfunctions, monitoring operator behavior, and
largest use case for data lakes after general analytics. The so-called operational reporting) as well as latent analytics uses (optimizing
marketing data lake is gaining adoption because it provides a single routes, vehicle maintenance, and customer service).
repository for the many data-driven functions of modern marketing,
Deploying the data lake in the cloud makes sense when IoT data is
including data from numerous channels and customer touch points,
shared with people outside the enterprise, usually through browser-
360-degree customer views, third-party data about consumers,
based applications. For example, manufacturers of high-end farm
campaign design and execution, and customer analytics (profiling,
machinery collect data from combines so the combine owner can
segmentation, profitability, etc.). Some customer channels and
analyze the vehicle’s performance and his use of the machine. Some
touch points are inherently Internet-based, especially those involving
utility firms share data about electricity consumption so property
website visitor behavior and e-commerce. External data providers
owners can understand and optimize their consumption.
are likewise on the Internet, ranging from traditional consumer
demographics to modern social media.
Furthermore, marketing departments tend to prefer SaaS-based
NUMBER SIX
applications for sales force automation and marketing campaign
management, both of which involve larger regular data loads and USE THE DATA LAKE AS A SELF-SERVICE DATA EXPLORATION
extracts across the Internet than even most other SaaS applications. PLATFORM
Additionally, marketing data solutions are not complete without
functions for MDM and data quality. Finally, these data-driven,
digital approaches to marketing are universally practiced by large As we just saw, the onslaught of new data from new sources is
firms, where marketers are geographically distributed but need to both a DM problem and a business opportunity. The cloud-based
share large volumes of data easily. All these data types, sources, data lake is a likely solution for capturing new data. How does an
and scenarios point to the cloud-based data lake, augmented with organization get beyond data collection to business value?
modern DM and analytics tooling, possibly with an iPaaS platform.
Data exploration leads to business value and
Business-to-business (B2B) partner data. Procurement compliance.
and supply chain operations are mission-critical in manufacturing
and retail industries. Yet these industries have only recently begun Data exploration (sometimes called data discovery) is a recently
serious modernization efforts as they move from faxes and phone emerged best practice that is critical to wringing business value
calls to managed file transfers and automated transactions. Data from data that is new to an organization. Diverse users are now
lakes are easily optimized for file-based data and so are a good exploring data regularly, from technical people such as data analysts
fit for landing, processing, archiving, and analyzing partner data and scientists, to business domain experts such as marketers,
exchanged via EDI, XML, and JSON files. Because these files arrive accountants, and procurement specialists.
With new data, there is the prospect of new insight. A desirable your peril because without them, most users will not use a data lake
outcome of data exploration is the discovery of facts about the and it will be considered a failure.
business or one of its processes that were previously unknown.
Note that self-service data exploration often leads directly into
However, users also explore to simply understand the new data.
other self-service practices and tool types. For example, once a user
Once they see which business domains, entities, and activities
has explored individual records of customers who churned recently,
are represented in it, they can decide how this information can
he or she might immediately move to a self-service data prep tool
be applied to solutions in DM, operations, and analytics. This
(which ideally is in the same toolset as the exploration tool) to pull
understanding also helps them decide which data governance,
all similar records into a data set, then augment the data set with
privacy, and security policies to apply to the new data.
demographics of these and similar customers. The user then applies
Data profiling leads to DM solutions that deliver self-service data visualization to the data set to get a different or
quality data. deeper understand of the current form of churn. Once the data set
and its visualization are polished, the user may publish them via
The technical adjunct to exploration is profiling. Technical data self-service so other users can subscribe to them and understand
management professionals (or special business users such as the latest form of customer churn.
data stewards) profile new data to understand its technical state
in terms of structure (format, model, or schema); metadata and A data lake and its tooling can enable these self-service tasks.
other semantics (whether embedded, extracted, or deduced); The lake provides the data and the semantics appropriate to
condition (quality, standardization, completeness); and details of exploration. The lake and its DM tools enable the queries that are
delivery or storage (files, streams, preferred interfaces, and so on). at the heart of most exploration and data prep. The lake can
This information—called a data profile or data intelligence—is store self-service data sets, preferably in special areas such as
indispensable for getting started with new data as well as for sandboxes, so that various users and tools can subscribe to sandbox
improving, storing, and presenting it for fullest business use. content. To avoid swamp practices such as abandoned data sets,
the lake’s tools should automatically delete sandboxes that lie
A data lake can enable broad data exploration unused for long periods of time.
and profiling.
Data exploration and data profiling usually require detailed source
data, and large quantities of it—features at which all data lakes
excel. The multiple ingestion methods supported by most lakes also
mean they excel at landing and staging all kinds of data; hence,
a data lake is ideal for the first capture of new data and the first
study of it.
New data aside, the average data lake contains highly diverse
data, both modern and traditional, from both internal and external
sources. The lake’s eclectic mix of data makes it ideal for data
exploration sessions in which the user develops analytic correlations
among disparate data points in hopes of discovering connections.
For example, correlations can reveal a person who is present at
multiple accidents (suggesting fraud), seasonal conditions that
lead to damages (enlightening actuaries), or bottleneck steps in a
process (suggesting efficiency improvements).