Sie sind auf Seite 1von 7

Hortonworks /

Informatica /
Teradata
PAGE 24
DESIGNING
THE DATA LAKE
FOR FASTER TIME
TO VALUE

DATA
Progress
DataDirect
PAGE 27
LOADING YOUR
SAAS DATA INTO
A BIG DATA LAKE

LAKES
BEYOND THE HYPE AND
READY FOR THE ENTERPRISE

Best Practices Series


CALMING THE
TURBULENT WATERS OF
DATA LAKES

Best Practices Series

Data lakes have become a mainstream strategy for many However, as with any major enterprise data initiative, the con-
enterprises over the past couple of years, with promises of cept has to be sold to the enterprise. Data lakes absorb data from
greater flexibility in the way data is handled and made available a variety of sources and store it all in one place, with all the nec-
to decision makers. A recent survey by Unisphere Research, a essary requirements for integration and security. Data lakes are a
division of Information Today, Inc., found that 20% of data response to the eternal problem of data silos, attempting to bypass
managers and professionals are currently deploying data lakes, these various, fragmented environments to finally maintain data
and 45% are learning and researching about them. A majority, all in one place. The data lake also reduces the requirement for
56%, have a positive impression of the concept in that it may immediately processing or integrating the wide variety of data
serve some value to their businesses. At least 38% indicate their formats that comprise big data.
companies are committed to data lake strategies (“Data Lake Here are some best practices for successful implementation
Adoption and Maturity Survey Findings Report,” Unisphere of a data lake:
Research, October 2015).
In most cases, data lakes are defined as data environments THINK LONG TERM
that capture and store raw data. A data lake comprises data in its The advantage that data lakes offer is the ability to keep data at
original format, to be transferred and transformed at a later date the ready for applications or queries that have yet to be designed.
as applications and end users demand. The thinking behind the The bottom line is no one knows what data will be valuable in 5
concept is that the analytics or questions to be applied against the years, or how it might be used. There may be entirely new business
data may have not yet been identified, and by holding the data in a lines built around data that is currently being cast aside in today’s
relatively accessible environment, it is open for future innovation. environments. As a result, data needs to be maintained and stored

22 BI G D ATA QU A RTERLY | SPRIN G 2016


sponsored content

for purposes and applications that have yet to be determined. about what data lakes are and what role they serve, there are clear
Some data will be needed immediately, while other data will need capabilities for data lakes that vary from data warehouses. For
to be stored. To see this through, enterprise input is essential to example, while data lakes are intended for the storage of raw
identify those information areas that decision makers see as hav- data, making it easy to retrieve, by contrast, data warehouses
ing future potential. are intended to serve as sources of extracted and vetted data for
specific areas of the business. Importantly, however, data lakes
VIEW DATA LAKES AS A BUSINESS GROWTH will typically need to sit in front of data warehouses or other
OPPORTUNITY VERSUS A COST-SAVING MEASURE Hadoop deployments, with data that may be processed within
As with any major IT initiative, cost-savings drives many data those environments at a later date.
lakes initiatives. However, the value will ultimately be realized
in the potential avenues it offers for business growth. The next
frontier for data lakes is providing organizations with greatly
The data lake concept: The analytics or
enhanced analytical opportunities. The analytics made possible
by large stores of information in data lakes facilitates customer
relationship management, predictive analytics, preventive main- questions to be applied against the data
may have not yet been identified, and by
tenance, fraud detection, and a range of other applications.

IMPLEMENT SOUND DATA GOVERNANCE holding the data in a relatively accessible


environment, it is open for future innovation.
Data governance is just as important to data lakes as it has
been to data warehousing and other big data projects in recent
years. The data lake is increasingly recognized as both a viable and
compelling component within a data strategy, with companies
large and small continuing to move toward adoption. Governance
is the key challenge to data lakes, cited by 71% of respondents in ADDRESS DATA SECURITY
the Unisphere Research survey. AND COMPLIANCE ISSUES
When data lakes first appeared on the scene, they were catch- Data lakes help reduce the need to move data through the
alls for structured and unstructured data, even though the data organization, which is a costly and complex undertaking, not to
might not have been useful at the moment. As applications mention the security implications that go with it. At least 67%
and functions are developed, custom adjustments or recoding of respondents to the the Unisphere Research survey indicated
is performed to meet the requirements of the situation. How- they are concerned about data security in these settings. There is
ever, organizations may not have the data talent to address these considerable concern that data lakes will serve as repositories for
requirements on an ever-increasing scale. Data lakes are, from all types of unvetted and insecure data or even files and docu-
the start, an enterprise project that requires enterprise input and ments that may be in violation of compliance mandates—again,
ownership. the “data swamp” fears. Data lakes need to have the same secu-
rity, governance, and accountability as any other data environ-
PUT SEMANTICS IN PLACE ment within the enterprise. Authentication, authorization, and
The best way to avoid having data lakes decay into haphazard encryption are key to managing a secure and compliant data lake
“data swamps” is to have a strong, semantic structure to provide environment.
consistency and enable the easy discovery and access of essen-
tial data. It’s also important to map relationships between data, AUTOMATE AS MUCH AS POSSIBLE
which provides the foundation for trusted data that is used within Data lakes will have a lot of moving parts, handling ingestion,
analytic applications. Semantic models also need to be clear to integration, processing, and storage, among other important
human operators, as well as map to the needs of the business. functions. This all needs to be built and embedded into the enter-
prise data framework to the point where it’s virtually invisible
DATA LAKES SHOULD AUGMENT—NOT REPLACE to end users. The more these various functions are automated,
—CURRENT TECHNOLOGY the more data managers and business users can focus on their
There remains a great deal of confusion in the market, with higher-level core competencies. The various functions of data
many managers and professionals assuming that the addition of lakes—from ingestion to storage—can all be automated. Automa-
a data lake reduces the need for an enterprise data warehouse. tion also makes real-time data movement and analytics possible.
Nothing could be further from the truth. While there is uncer-
tainty—and perhaps some purposeful ambiguity—in the market  —Joe McKendrick

DBTA. COM/ BI GDATAQUARTERLY 23


sponsored content

Designing the
Data Lake for Faster
Time to Value
THE DATA LAKE OPPORTUNITY with one another in a timely manner. lifecycle policies usually because it is
Data has become the lifeblood Moreover, with the exponential increase unstructured and in large quantities.
of nearly every industry-leading in data volume and data proliferation Since no one explores it, there is no
company. But the ability to turn this across systems, business analysts run demand for that data from the user
data into valuable business insights the risk of delivering inaccurate reports community. Based upon the assumption
with the right data delivered at the right and predictions because of data that is of infrequent use, the data is not
time is what separates industry leaders insufficient, incomplete, inconsistent, examined for the possibility of gold
from laggards. The concept of the data inaccurate, or insecure. nuggets of value. Absent this discovery
lake pattern has developed as a means or exploration, the data naturally goes
to economically harness and derive FIT-FOR-PURPOSE APPROACH unused. This creates a chicken-and-egg
value from exploding data volume and Prior to the advent of the trend puzzle whereby historical access patterns
variety. New data sources such as web, toward big data, there was a singular bias future availability.
mobile, and connected devices along and traditional approach to data
with new forms of analytics such text, management whereby data was fully This brings up the central challenge
graph, and pathing have necessitated a normalized and persisted in a relational of how to deal with data of unknown or
new data lake design pattern to augment data warehouse before any value questionable value from a burgeoning
traditional design patterns such as the could be derived. The theory behind number of data sources and types.
data warehouse. this tightly coupled schema-on-write Moving it into the data warehouse or
philosophy was that it was essential to database management system isn’t ideal
Enterprise data warehouses have absorb the costs of data management for two reasons.
always struggled to balance time up-front so that all available data was
to delivery against auditability, of high fidelity. The requirement for One is that the schema has to be pre-
stability, ease of use, data quality, and high fidelity data for high performance defined, specifically as a “compromise,”
performance. The new challenge in a analytics continues to grow, but today before data can be loaded into either of
world with growing volume, variety, and the relational data warehouse approach these environments. This isn’t a trivial
velocity of data is that business analysts is no longer sufficient to manage all effort and it often limits organizations
must manually find and reconcile enterprise data. to only using data they know to have
fragmented, replicated, incomplete, and value beforehand. Also, the act of
inconsistent data across the organization. Meanwhile, infrequently used data creating the schema is likely to constrain
As a result, business analysts face delays or so-called “dark data” has historically data to a few anticipated use cases.
in accessing needed data and sharing it just been deleted as part of information Much of the predictive and diagnostic

24 BI G D ATA QU A RTERLY | SPRIN G 2016


sponsored content

value contained in the raw data may of data rationalization to occur post- data-driven customers, we define the
then be lost. ingestion for data consumers who don’t data lake as a collection of long-term
require the highest fidelity of data. data containers that capture, refine, and
And, two, the storage cost model for Meanwhile, new curation processes explore any form of raw data at scale,
the data warehouse or DBMS may not discovered by data consumers can also enabled by low-cost technologies, from
lend itself to wholesale data ingestion. be operationalized into automated which multiple downstream facilities
This leaves organizations searching for and repeatable ETL-like processes. By may draw upon.
a place where they can explore the data enabling secured and governed access
for alternative interpretations to the to data by data consumers themselves, It is important to note that a design
compromise, then discover the value of organizations can achieve a balanced pattern is an architecture and set of
those alternatives. and well-governed data democracy. corresponding requirements that
have evolved to the point where there
Newer big data technologies and “The purpose of a data lake is to is agreement and best practices for
approaches have enabled a new fit- present an unrefined view of data to only implementations. How you implement
for-purpose approach for delivering the most highly skilled analysts,” notes it varies from workload to workload,
the right data to the right people at Gartner (2015), “to help them explore organization to organization. While
the right time. These newer, loosely their data refinement and analysis technologies are critical to the outcome,
coupled schema-on-read philosophies techniques independent of any of the a successful data lake needs a plan. A
are enabling organizations to leverage system of record compromises that may data lake design pattern is that plan.
a “data lake pattern” that collects all exist in a traditional analytic data store
of their raw data into persistence (such as a data warehouse).” The data lake definition does
systems prior to any modeling or not prescribe a technology, only
transformation. When data finally exits the requirements. While data lakes are
curation process inside the data lake typically discussed synonymously
DATA SECURITY & GOVERNANCE environment, it can be loaded into a with Hadoop – which is an excellent
Still, data lake patterns are not downstream integrated data warehouse choice for many data lake workloads
without their challenges. With the (IDW). The IDW is now optimized for – a data lake can be built on multiple
growing interest in self-service storing the highest fidelity data and technologies such as Hadoop, NoSQL,
analytics tools, organizations are performing high performance analytical S3, RDBMS, or combinations thereof.
seeing data proliferate through teams queries. A fully leveraged IDW will be
with increasing risk for security and deployed with access to its structured One way to look at it is that Hadoop
governance failures. In the face of such data products from across multiple is a technology, while the data lake is a
risks, some organizations are locking lines of business within the enterprise. design pattern fit for certain workloads.
down access to self-service tools and This promotes reuse of high fidelity Hadoop does not perform all of those
Hadoop clusters, thus defeating the data through a service-oriented model workloads. You have to go get ETL,
purpose of leveraging data lake patterns. while datasets curation happens quickly data wrangling, and data governance
However, organizations need not make a and confidently in the data lake for tools, as well as many others in order to
false tradeoff between data anarchy and maximum business value. make the data lake the best it can be. In
data tyranny. contrast, the data lake is not a real-time
DEFINING THE DATA LAKE complex event processing tool. In fact,
Through collaborative and machine- Confusion regarding the definition it’s not a tool at all. Within the data
learning guided curation processes, of a data lake abounds in the absence lake definition and design, we do not
datasets can be refined from raw form of a large body of well understood best find things such as Hbase or real-time
into datasets with increasing levels of practices. Drawing upon many sources applications development. Nor do we
quality. This enables a bare minimum as well as on-site experience with leading expect data lakes to be data products

DBTA. COM/ BI GDATAQUARTERLY 25


sponsored content

such as recommendation engines. data that your business needs to quickly not preconfigured. Teradata does the
What this means is that Hadoop can be and repeatedly get business value. It hardware and software integration plus
applied to other design patterns, just like accelerates developer productivity and plenty of testing so you don’t have to do
an RDBMS can be applied to workloads analyst productivity, while enabling IT it. The Teradata Appliance for Hadoop
other than data warehousing. agility in a fully secure and governed is delivered ready-to-run and optimized
environment. Data security and data for enterprise-class big data storage and
BRINGING THE RIGHT DESIGN governance are not merely enforced discovery. Teradata also provides self-
AND TOOLS TOGETHER through manual intervention. Instead, service tools and accelerators for reliable
The flood of new data from a universal metadata catalog is used to data ingest and powerful multi-genre
new sources has necessitated a new enable more data consumers to quickly analytics for the data lake.
methodology and set of economics and repeatedly get more business value,
for capturing and analyzing raw data from more data, without more risk. DESIGNING FOR FASTER
at scale—known as the data lake. You get simplified data ingestion and BUSINESS VALUE
The data lake enables organizations self-service data preparation capabilities The data lake design pattern focuses
to realize new insights from data of alongside data lineage and risk-centric your organization’s resources on the
unknown and under-appreciated data security. difficult job of leveraging fit-for-purpose
value, perform new kinds of analytics data to derive compelling business
that were not previously possible, Hortonworks provides Open insights. But, in order to move the
retain a longer corporate memory, and Enterprise Hadoop, the open source data lake from concept to real-world
optimize existing processes such as framework for distributed storage implementation, you need solutions
data integration. Organizations that are and processing of large sets of data on that work together effectively and
leading the way in delivering value from commodity hardware. Hadoop enables span the wide array of data-related
the data lake realize the importance of businesses to quickly gain insight from considerations. The path to the data
starting with a solid design based on massive amounts of structured and lake can be challenging, but it doesn’t
required capabilities, and satisfying those unstructured data. With YARN as the have to be if you follow a trusted route
capabilities with enterprise-grade big architectural center, Open Enterprise that experts have already mapped out.
data technologies. Hadoop enables enterprises to run With an efficient, high-performance,
diverse applications on the common and highly-scalable solution from
As close partners, Informatica, data lake. Informatica, Hortonworks, and
Hortonworks, and Teradata have worked Teradata, the data lake design pattern
to integrate technologies that enable the Teradata provides a variety of can accelerate your organization’s time
data lake design pattern. This approach products and services to help enterprises to business value.
to data lakes leverages low-cost data accelerate their data lake journey. Think
platforms to provide a collaborative and Big—A Teradata Company—provides
flexible data curation environment to services for Data Lake Architecture, FOR MORE INFORMATION
capture, integrate, refine, and analyze Data Lake Foundation, and Data ABOUT THE UNIFIED DATA LAKE
data at scale. Lake Analytics. Performance hurdles, ARCHITECTURE, VISIT
prolonged implementation periods, http://informatica.com/bigdata,
Informatica provides the Big Data and reliability issues—are solved by http://www.teradata.com/products-
Management layer which enables you to the Teradata Appliance for Hadoop and-services/data-lake-products, and
find, prepare, govern, and protect the big when compared to solutions that are www.hortonworks.com.

26 BI G D ATA QU A RTERLY | SPRIN G 2016


sponsored content

Loading Your SaaS Data


Into a Big Data Lake
Enterprises today are using analytics web and IoT devices, it is very difficult JDBC interface, it is not easy to develop
—descriptive, predictive and prescriptive to pull together enterprise customer standards-based connectivity for ach
—to quickly react to market changes information from across various SaaS SaaS application because each source
and predict customer needs. As a result, applications. Most B2B companies find has a proprietary REST or SOAP API
data has become a critical resource for it particularly difficult to gather enough and sometimes both. To add to the
every business. Research shows that by customer data that is consistent, timely complexity, each of these APIs return
2018, the world will produce around and accessible. However, new SaaS-based the results in a different format—JSON,
50,000 GB of data every second, 90% of solutions have empowered businesses XML, or CSV.
which will be unstructured. However, to get access to high-quality customer
traditional data warehouses are information. DATA CONNECTIVITY
optimized for storing structured data. So Currently, there are three fast- MADE SIMPLE
every business needs to ask, how can we growing types of cloud applications: Progress® DataDirect®, the leader in
get the most value out of that 90%? 1. Marketing Automation: Marketo, data connectivity solutions, provides a
Hubspot and Oracle Eloqua help sophisticated solution to this complex
ADOPTING AN EFFECTIVE DATA identify potential customers through ingestion problem. They provide highly
MONETIZATION STRATEGY lead-generation, scoring, nurturing scalable and reliable JDBC drivers for
Big data lakes provide a cost-effective and conversion activities. all the major SaaS data sources. With no
alternative to store unstructured data. 2. Sales Automation: SalesForce, web services APIs to deal with, it’s simple
A schema is created after a data scientist SugarCRM and Microsoft CRM to quickly deploy and easily manage
selects the data he wants to use— enable businesses to convert leads drivers. And if the data lake is behind a
commonly known as schema-on-read. into customers by customizing the firewall, Progress® DataDirect Cloud®
Enterprises can reap big benefits by this sales cycle based on lead score. provides a powerful data connectivity
approach since it enables IT to retain all 3. Customer Service Automation: service that provides universal
the collected data and processing only Oracle Service Cloud and ServiceMax connectivity to all your data sources—
useful content on demand. This enables help provide excellent customer wherever they are, with no complex
big data analytical tools to perform support. Retaining a customer is firewall changes required. The simple
broader analyses of data because there is much easier than winning a new configuration in DataDirect connectors
real-time access to all the raw data being customer and is critical for long-term frees you from managing API changes
produced. Thanks to this innovation, success. and can improve CPU efficiency up to
organizations are now able to derive 400% and save over 550% in memory.
more profitability from the data— STANDARDS-BASED Sumit Sarkar, the Chief Data
known as data monetization. CONNECTIVITY IS THE KEY Evangelist at Progress Software, says
Critical customer information such as that “We’re seeing a trend in accessing
EASY ACCESS TO YOUR channel activities, customer web journey customer data from SaaS applications
SAAS APP DATA TRANSLATES and post-sales support is now hosted on to build data lakes. This brings the most
TO BETTER BI the cloud. These applications can also valuable data, customer data, into the
Data lakes thrive on the 4 Vs of provide access to historical trends, reports big data ecosystems where organizations
data—volume, velocity, variety and and predictions. Many organizations use are making significant investments to
veracity. So, having easy access to data Apache Sqoop to transfer all this data derive insight.”
is one of the basic prerequisites for a between relational database systems and
data lake. While consumer data can data lakes. Although Apache Sqoop can PROGRESS DATADIRECT
be integrated across mobile, social, connect to any data source supporting www.datadirect.com

DBTA. COM/ BI GDATAQUARTERLY 27

Das könnte Ihnen auch gefallen