Sie sind auf Seite 1von 15

The New Rules for

Your Data Landscape


How business users and
IT work together to
build trusted data
products in the
data supply chain

Syncsort | The New Rules for Your Data Landscape


Introduction

Everything about data is changing—its rate of growth, how it


flows, and how it takes shape. Consequently, the division of labor
around data is also changing. IT and the business represent a key
partnership, but new tools and ways of thinking are empowering
the business to do more.

• How do you transition to this brave new world?

• How do you liberate the business to take greater ownership


of data— and while continuing to govern that data?

Join us on a journey as we describe the new rules that


are transforming the relationship between business
and IT and unleashing the power of data.

IT and the business represent a key partnership, but


new tools and ways of thinking are empowering the
business to do more.

Syncsort | The New Rules for Your Data Landscape 02


It’s Time to Give the Business What It Wants

When data was scarce and expensive, and the tools were technically
complex, it made sense for IT to own and manage data. But those days
are gone, which leads us to the first rule of the new data landscape:

RULE 1: THE BUSINESS OWNS THE DATA


It’s not a power grab. It’s common sense. The business already
understands and uses the data. The business is also accountable
for ensuring that its usage complies with regulatory requirements.
Furthermore, data no longer primarily resides in or is generated
by the data center—IT’s domain. Data is increasingly generated
outside the organization. The business is better suited to understand
and use this data, such as marketing and clickstream data, partner
data, social data, and more.

These trends have led to the business gradually taking ownership


of the data. We saw first glimpses of this with business applications
and processes, and later with business intelligence and self-service
reporting. Today, we’re seeing data preparation, data quality, and the
Data is increasingly generated outside the organization.
ability to define and manage data products that are created by a data
The business is better suited to understand and use this
supply chain fall under the influence and control of the business.
data, such as marketing and clickstream data, partner
data, social data, and more.
The business has gradually taken over data. It’s time to give it what
it wants, and now technology is available to enable organizations to
do that.

Syncsort | The New Rules for Your Data Landscape 03


Introduction to the Data Supply Chain

Data no longer moves through the enterprise in a predictable manner. The Rise of the Chief Data Officer
What was once a highly controlled, one-way trip from an application to
the data warehouse or data mart is now a complex journey involving The shift of data ownership from IT to the business is giving rise to role of
multiple data sources and destinations, including Hadoop, the Cloud Chief Data Officer (CDO). But there are also other factors at play. In many
and special purpose repositories. industries, even very traditional ones, data has become the business, and
organizations are therefore changing the way they think about data.
RULE 2: THE DATA SUPPLY CHAIN
IS THE NEW ORGANIZING PRINCIPLE Data is a valuable asset that can be used to generate revenue or itself be
monetized. As a result, companies need to think about data—not just
The data supply chain describes how to produce the right information in isolation, within single lines of business—but holistically. Hence, the
product to a consumer of that information at the right time through the elevation of data to a C-suite position.
right channels. The data supply chain can be a multi-directional flow
The CDO’s primary responsibility is to treat data as an asset that can be
or a one-way trip. Either way, it provides an end-to-end view of how
exploited to create value or defended to reduce risk. This means that the
information for a specific purpose is produced and delivered to where
business must govern data to ensure that it is understood and trusted.
it needs to be.
This means proactively finding and solving data quality problems so that
the data is accurate. The CDO’s objective is to increase the business’ trust
Data Supply Chain vs. Physical Supply Chains and confidence in the data and the quality of analytic insights that come
from it.
In manufacturing, organizations know how the end product of a supply
chain will look. That’s not necessarily the case with the data supply chain.
The business may not know the value of the data product when the
supply chain is implemented, so the supply chain must facilitate on-the-fly
discovery and invention, usually through a storage mechanism such as a
data lake where business users can ‘shop for data.’ Furthermore, the data
supply chain must be open and visible and track lineage so that users
understand how a data product was produced.

Syncsort | The New Rules for Your Data Landscape 04


Provisioning Your Data Across a Hybrid Landscape

Data is everywhere: in the on-premise data centers, in cloud applications,


on devices, with third parties, on mainframes, and the list goes on. You
need to be able to access and harmonize data from any point in the
hybrid landscape. The problem is that data is in formats that aren’t
necessarily compatible or readily consumed in the data supply chain.
For example: 01010010010101101000
10101000101000101101
00100000010101000101
10000100110101100001
Newer Big Data Sources 00111000101010000110
10100010101001000101
01000101101010000100
01010010010101101000
Much of the big data that organizations are excited to tap for analytics
comes from relatively new sources—such as websites, mobile applications,
social media, and sensors. Not only do these data sources originate from
different systems, but they are typically in unstructured or semi-structured Data is Everywhere
formats and generated as data streams.
ON-PREMISE DATA CENTERS

CLOUD APPLICATIONS

DEVICES

THIRD PARTIES
Websites Mobile Apps Social Media Sensors
MAINFRAMES

Syncsort | The New Rules for Your Data Landscape 05


Provisioning Your Data Across a Hybrid Landscape

Traditional Data Sources


For decades, organizations have housed important transactional and
01010010010101101000
historical data in traditional enterprise systems. In contrast to newer data,
10101000101000101101
this data is structured and processed in batches. However, traditional
00100000010101000101
doesn’t mean homogeneous or simple. Many industries that have dealt 10000100110101100001
with large volumes of data over the course of decades—such as banking, 00111000101010000110
insurance, retail, and healthcare—rely on mainframes for their critical 10100010101001000101
applications. These systems hold valuable data assets that are beneficial 01000101101010000100
for discovery—if you can unlock them. Unfortunately, mainframe data 01010010010101101000
can be huge, complex, and not readily compatible with other data types. 10101000101000101101
It’s therefore typically difficult and expensive to oper-ate on and integrate 00100000010101000101
mainframe data. 10000100110101100001
00111000101010000110
Once the data is accessed from all the varied sources, it needs to be sent to 10100010101001000101
the right environment, with the right tools, to be transformed into a high- 01000101101010000100
value data product. Data lakes, where diverse data can be stored without
prior integration and built on Hadoop and similar frameworks, are popular
for this step in the supply chain due to their ability to scale cost effectively.
As with physical supply chains, efficiency is important, so minimizing
development time and maintenance without sacrificing performance is
essential. This requires commercial, enterprise-grade technology with
processes that work inside big data stores, a lightweight footprint, and
flexible design to adapt to changes in the landscape.

Syncsort | The New Rules for Your Data Landscape 06


The Data Product: The Consumable of the Data Supply Chain

Data can be put to use at any point in the data supply chain: the
beginning, middle, or end. If a collection of data is deemed important, A data product is a well-defined dataset
it must be documented and managed like a product and its quality that is well suited for a set of use cases.
validated as sufficient for the data’s purpose. Its quality has been addressed, and it
has clear lineage.
RULE 3: THE BUSINESS CREATES
AND MANAGES THE DATA PRODUCT
A data product is a well-defined dataset that is well suited for a set of use
Documented
cases. Its quality has been addressed, and it has clear lineage. The dataset
is documented, communicated, and reliable. Who owns the data is clear,
and so is the process for requesting use of the data, for determining service
level agreements that govern usage, and for communicating changes.

The point of the data supply chain is to produce a trusted product - that is,
Communicated
a dataset that meets a set of criteria dictated by business policy regarding
the data’s use. This requires data governance and data quality.

Reliable

Syncsort | The New Rules for Your Data Landscape 07


Data Quality

Quality is a critical piece of the data supply chain that becomes more
important as the sources of data become far more diverse.

To create a trusted data product, quality must be part of the data


supply chain. It’s important to understand the data’s quality on the way
out and on the way in, and to address data quality at the right place and Your Data Quality Mission
at the right time. This may mean addressing quality at multi-ple points.
For example, data may be cleaned at the source but require additional
validation checks when it’s integrated to ensure that the combined data MARKETING
is consistent and accurate. Effectively target customers and prospects across
multiple channels
Data quality becomes an issue when you start adding more data or
changing data, and when you want a unified view. For example, if data is FINANCE
not cleaned and keys don’t match, it might not integrate properly. Comply with financial reporting requirements
Different parts of the business have differing missions that require DATA SCIENCE
different data quality techniques. Marketing’s mission is to effectively Validate hypotheses about the business, including
target customers and prospects across multiple channels. Their the potential for data to provide new business
techniques include matching and merging records, eliminating insights
duplicates, validating address/location, and enriching/appending
customer data. Finance’s mission is to comply with financial reporting
requirements. Their techniques include assessment of accuracy,
consistency checks at each step, and audit trails. Data science’s mission
is to validate hypotheses about the business, including the potential for
data to provide new business insights. Their techniques include data
discovery, model building, and hypothesis testing.

Syncsort | The New Rules for Your Data Landscape 08


Data Quality

Each data supply chain is different in terms of the business’ needs and experimental purposes or new data products that are not yet defined.
requirements. The supply chain likely includes a process that involves You need the flexibility to perform data quality and data integration
collecting data in a data lake. From the data lake, there are multiple where it’s needed, being careful not to impose quality that will strip
paths out. Some paths are for known data products and others are for potential value from the data.

01010010010101101000
10101000101000101101
00100000010101000101 KNOWN DATA PRODUCTS
10000100110101100001
00111000101010000110
10100010101001000101
01000101101010000100 EXPERIMENTAL PURPOSES
01010010010101101000
10101000101000101101
00100000010101000101
10000100110101100001 NEW DATA PRODUCTS
00111000101010000110
10100010101001000101
01000101101010000100

Syncsort | The New Rules for Your Data Landscape 09


Six Data Quality Dimensions

Data quality can be determined based on the following six attributes: 5• Validity – The extent to which the data conforms to defined
business rules. A value can be valid but not accurate. For example,
1• Completeness – The degree to which expected data attributes the customer’s birthdate may be a valid date, but incorrect.
are provided. Complet ness is expressed as a percentage of data that
meets the user’s expectations and data avail-ability. For example, 6• Timeliness – The degree to which data is adequately up to date
95% of surname record fields that need to be known are complete. for a given task. For example, the tax information provided on the
application is for the most recent tax year.
2• Coverage – T he degree to which a dataset is complete for all
required values. For ex ample, if a dataset of US zip codes covers
only 20 states, the dataset does not have complete coverage if the
requirement is for all contiguous states.
Data Quality Liability
Questions to ask yourself to determine your data quality liability:
3• Accuracy – The data reflects the real-world state. For example,
the company name is the real company name, and the company • Do you understand the data you have?
identifier is verified against the official da-tabase of companies
• How often do you assess your data quality?
being used (Dun & Bradstreet, SEC, and so forth). Note: Data can
be complete but not accurate. • Can you see how data is trending and share these findings with
key stakeholders?
4• Consistency – Whether the facts across multiple datasets • Is data accurate?
match and represent the same objects. Consistency also takes • Is data complete?
into account whether data is at the same level of aggregation
• Is data valid?
(e.g., sales transaction data may show individual order line items
for each customer while monthly sales reporting simply shows • Is data consistent?
total order value by geography). • Do you have controls in place so that when data doesn’t meet
requirements, it is assigned to a data steward transparently?

Syncsort | The New Rules for Your Data Landscape 10


Data Governance

Where there’s a data supply chain, there’s a need for data governance.
Data governance consists of the processes and people that provide
confidence and trust in a data product. Data governance explains what The Two Approaches
policies are required, who is responsible for what, sets the process for
resolving problems, and specifies the expectations for quality. Data
to Data Governance
that’s not properly governed creates institutional risk, may lose its
value and delivers diminishing returns.
Governance from above
There are two approaches to data governance:

• 1 - Governance from above. The degree to which expected


data attributes are provided. Completness is expressed as a
percentage of data that meets the user’s expectations and data
avail-ability. For example, 95% of surname record fields that need
to be known are complete. Governance for a purpose

• 2 - Governance for a purpose. Governance for a purpose is


generally established within the enterprise for a specific business
purpose, such as to generate value or revenue, or to reduce
costs. This type of governance is subjective in nature and is
developed once important data has been identified. Governance
is used to document the data and help support collaboration
(who is doing what). It also documents required conditions,
such as service-level agreements (SLAs), for the data.

Syncsort | The New Rules for Your Data Landscape 11


Implications of the New Rules
The data supply chain requires reconsideration of how data is moved, manipulated, and cleansed.

Retrieving and moving data Profiling data Cleaning data


(to another environment for transformation)

Before This was achieved using data integration Data profiling was performed as a one-time Data quality tasks were performed before
and ETL tools. standalone task in an IT development project data got to the data warehouse.
as a means to define ETL requirements.

Now Various types of data are coming from Data must be profiled continuously to The data quality problem is ripe as soon
multiple, varied sources: traditional understand what business rules are needed as data is created, changed, or consumed,
applications, Cloud-based applications, to govern the data, assess risk, and identify and quality must be addressed wherever
the mainframe, relational data stores, trends and issues in required data quality. necessary along the supply chain to ensure
the enterprise data warehouse, as well the data product meets applicable policies
as streaming data from Internet of and requirements.
Things (IoT) sources (e.g. sensors),
mobile devices, websites, and so on.

Implication IT must be able to transport and transform The right data quality processes must be Data quality must be embedded within
data into whatever shape is needed so that put in place at the right place and time data integration and application processes
the business can act on it. to establish effective levels of trust and anywhere in the data supply chain-including
reliability. at the source and when new data sets are
assembled and added to the supply chain.

• Data arrives at its destination in the • Data profiling results can be achieved • C
lean data is provided everywhere it
Benefits final format, so you’re not staging and and assessed by a broader array of is needed.
storing data unnecessarily. business users.
• Data is cleansed using a “just-in-time”
• Fewer workloads in the cluster. • Data profiling information can be used as approach that ensures information is
input to analytic processes. not lost.

Syncsort | The New Rules for Your Data Landscape 12


Requirements for a Data Supply Chain That
Creates Governed & High Quality Data Products
What does it take to build a data supply chain that produces a trusted data product and is as open and flexible as possible,
enabling integration of additional content as needed?

• The ability to bridge traditional and big data • Ability to adapt to future computing frameworks.
environments without deploying different products that require This prevents you from having to change your applications
specialized skillsets. This will give you maximum flexibility to build as new technology platforms are introduced. A “design once,
the supply chain across a hybrid landscape, taking advantage of any deploy anywhere” approach allows you to adapt to the emerging
volume or variety of data from any source across your organization. data stack and transition applications at your own pace and
supply data wherever it’s needed. You can also reuse shared
• Integrated solutions that combine best-of-breed components and deploy them anywhere to build multiple supply
technologies for data profiling, data quality, and data chains in a consistent manner.
validation/verification/enrichment along with the ETL process.
This ensures that ingested data meets quality requirements for • Flexible deployment options. A technology that can be
a given use case, without the burden of a heavy software stack deployed on-premises or in the Cloud, where it is managed for
that can be expensive and inflexible. you, gives you the flexibility to deploy the solution based on your
business needs.
• Simplicity and ease of use. Intuitive graphical user interface
and no coding enable collaboration between IT and the business
users, and rapid adoption across the organization.

Syncsort | The New Rules for Your Data Landscape 13


Conclusion

The business has a much greater stake—and greater role—in data. As IT


relinquishes control, there is a shared responsibility for the two parties to
come together to give the business what it wants while also ensuring that
data products are usable, valuable, and properly governed. With the right
tools, organizations are developing data supply chains that incorporate
data integration, data quality, and data governance to produce trusted
data products. Because the business is driving the strategy and direction
of this supply chain, these data products are just as important—arguably
more so—than the outputs of a physical supply chain in creating
sustainable value for the organization.

Syncsort is a global leader in data liberation, integrity, and integration


for next-generation analytics. Visit www.syncsort.com to learn how
our best-of-breed portfolio of data quality and data integration software
delivers trusted business insights at thousands of organizations around
the world.

Syncsort | The New Rules for Your Data Landscape 14


About Syncsort

Syncsort is the global leader in Big Iron to Big Data software. We organize data everywhere to keep the world
working – the same data that powers machine learning, AI and predictive analytics. We use our decades of
experience so that more than 7,000 customers, including 84 of the Fortune 100, can quickly extract value
from their critical data anytime, anywhere. Our products provide a simple way to optimize, assure, integrate
and advance data, helping to solve for the present and prepare for the future. Learn more at syncsort.com.

www.syncsort.com
© 2018 Syncsort Incorporated. All rights reserved. All other company and product names used herein may be the trademarks of their respective companies.

Das könnte Ihnen auch gefallen