Beruflich Dokumente
Kultur Dokumente
When data was scarce and expensive, and the tools were technically
complex, it made sense for IT to own and manage data. But those days
are gone, which leads us to the first rule of the new data landscape:
Data no longer moves through the enterprise in a predictable manner. The Rise of the Chief Data Officer
What was once a highly controlled, one-way trip from an application to
the data warehouse or data mart is now a complex journey involving The shift of data ownership from IT to the business is giving rise to role of
multiple data sources and destinations, including Hadoop, the Cloud Chief Data Officer (CDO). But there are also other factors at play. In many
and special purpose repositories. industries, even very traditional ones, data has become the business, and
organizations are therefore changing the way they think about data.
RULE 2: THE DATA SUPPLY CHAIN
IS THE NEW ORGANIZING PRINCIPLE Data is a valuable asset that can be used to generate revenue or itself be
monetized. As a result, companies need to think about data—not just
The data supply chain describes how to produce the right information in isolation, within single lines of business—but holistically. Hence, the
product to a consumer of that information at the right time through the elevation of data to a C-suite position.
right channels. The data supply chain can be a multi-directional flow
The CDO’s primary responsibility is to treat data as an asset that can be
or a one-way trip. Either way, it provides an end-to-end view of how
exploited to create value or defended to reduce risk. This means that the
information for a specific purpose is produced and delivered to where
business must govern data to ensure that it is understood and trusted.
it needs to be.
This means proactively finding and solving data quality problems so that
the data is accurate. The CDO’s objective is to increase the business’ trust
Data Supply Chain vs. Physical Supply Chains and confidence in the data and the quality of analytic insights that come
from it.
In manufacturing, organizations know how the end product of a supply
chain will look. That’s not necessarily the case with the data supply chain.
The business may not know the value of the data product when the
supply chain is implemented, so the supply chain must facilitate on-the-fly
discovery and invention, usually through a storage mechanism such as a
data lake where business users can ‘shop for data.’ Furthermore, the data
supply chain must be open and visible and track lineage so that users
understand how a data product was produced.
CLOUD APPLICATIONS
DEVICES
THIRD PARTIES
Websites Mobile Apps Social Media Sensors
MAINFRAMES
Data can be put to use at any point in the data supply chain: the
beginning, middle, or end. If a collection of data is deemed important, A data product is a well-defined dataset
it must be documented and managed like a product and its quality that is well suited for a set of use cases.
validated as sufficient for the data’s purpose. Its quality has been addressed, and it
has clear lineage.
RULE 3: THE BUSINESS CREATES
AND MANAGES THE DATA PRODUCT
A data product is a well-defined dataset that is well suited for a set of use
Documented
cases. Its quality has been addressed, and it has clear lineage. The dataset
is documented, communicated, and reliable. Who owns the data is clear,
and so is the process for requesting use of the data, for determining service
level agreements that govern usage, and for communicating changes.
The point of the data supply chain is to produce a trusted product - that is,
Communicated
a dataset that meets a set of criteria dictated by business policy regarding
the data’s use. This requires data governance and data quality.
Reliable
Quality is a critical piece of the data supply chain that becomes more
important as the sources of data become far more diverse.
Each data supply chain is different in terms of the business’ needs and experimental purposes or new data products that are not yet defined.
requirements. The supply chain likely includes a process that involves You need the flexibility to perform data quality and data integration
collecting data in a data lake. From the data lake, there are multiple where it’s needed, being careful not to impose quality that will strip
paths out. Some paths are for known data products and others are for potential value from the data.
01010010010101101000
10101000101000101101
00100000010101000101 KNOWN DATA PRODUCTS
10000100110101100001
00111000101010000110
10100010101001000101
01000101101010000100 EXPERIMENTAL PURPOSES
01010010010101101000
10101000101000101101
00100000010101000101
10000100110101100001 NEW DATA PRODUCTS
00111000101010000110
10100010101001000101
01000101101010000100
Data quality can be determined based on the following six attributes: 5• Validity – The extent to which the data conforms to defined
business rules. A value can be valid but not accurate. For example,
1• Completeness – The degree to which expected data attributes the customer’s birthdate may be a valid date, but incorrect.
are provided. Complet ness is expressed as a percentage of data that
meets the user’s expectations and data avail-ability. For example, 6• Timeliness – The degree to which data is adequately up to date
95% of surname record fields that need to be known are complete. for a given task. For example, the tax information provided on the
application is for the most recent tax year.
2• Coverage – T he degree to which a dataset is complete for all
required values. For ex ample, if a dataset of US zip codes covers
only 20 states, the dataset does not have complete coverage if the
requirement is for all contiguous states.
Data Quality Liability
Questions to ask yourself to determine your data quality liability:
3• Accuracy – The data reflects the real-world state. For example,
the company name is the real company name, and the company • Do you understand the data you have?
identifier is verified against the official da-tabase of companies
• How often do you assess your data quality?
being used (Dun & Bradstreet, SEC, and so forth). Note: Data can
be complete but not accurate. • Can you see how data is trending and share these findings with
key stakeholders?
4• Consistency – Whether the facts across multiple datasets • Is data accurate?
match and represent the same objects. Consistency also takes • Is data complete?
into account whether data is at the same level of aggregation
• Is data valid?
(e.g., sales transaction data may show individual order line items
for each customer while monthly sales reporting simply shows • Is data consistent?
total order value by geography). • Do you have controls in place so that when data doesn’t meet
requirements, it is assigned to a data steward transparently?
Where there’s a data supply chain, there’s a need for data governance.
Data governance consists of the processes and people that provide
confidence and trust in a data product. Data governance explains what The Two Approaches
policies are required, who is responsible for what, sets the process for
resolving problems, and specifies the expectations for quality. Data
to Data Governance
that’s not properly governed creates institutional risk, may lose its
value and delivers diminishing returns.
Governance from above
There are two approaches to data governance:
Before This was achieved using data integration Data profiling was performed as a one-time Data quality tasks were performed before
and ETL tools. standalone task in an IT development project data got to the data warehouse.
as a means to define ETL requirements.
Now Various types of data are coming from Data must be profiled continuously to The data quality problem is ripe as soon
multiple, varied sources: traditional understand what business rules are needed as data is created, changed, or consumed,
applications, Cloud-based applications, to govern the data, assess risk, and identify and quality must be addressed wherever
the mainframe, relational data stores, trends and issues in required data quality. necessary along the supply chain to ensure
the enterprise data warehouse, as well the data product meets applicable policies
as streaming data from Internet of and requirements.
Things (IoT) sources (e.g. sensors),
mobile devices, websites, and so on.
Implication IT must be able to transport and transform The right data quality processes must be Data quality must be embedded within
data into whatever shape is needed so that put in place at the right place and time data integration and application processes
the business can act on it. to establish effective levels of trust and anywhere in the data supply chain-including
reliability. at the source and when new data sets are
assembled and added to the supply chain.
• Data arrives at its destination in the • Data profiling results can be achieved • C
lean data is provided everywhere it
Benefits final format, so you’re not staging and and assessed by a broader array of is needed.
storing data unnecessarily. business users.
• Data is cleansed using a “just-in-time”
• Fewer workloads in the cluster. • Data profiling information can be used as approach that ensures information is
input to analytic processes. not lost.
• The ability to bridge traditional and big data • Ability to adapt to future computing frameworks.
environments without deploying different products that require This prevents you from having to change your applications
specialized skillsets. This will give you maximum flexibility to build as new technology platforms are introduced. A “design once,
the supply chain across a hybrid landscape, taking advantage of any deploy anywhere” approach allows you to adapt to the emerging
volume or variety of data from any source across your organization. data stack and transition applications at your own pace and
supply data wherever it’s needed. You can also reuse shared
• Integrated solutions that combine best-of-breed components and deploy them anywhere to build multiple supply
technologies for data profiling, data quality, and data chains in a consistent manner.
validation/verification/enrichment along with the ETL process.
This ensures that ingested data meets quality requirements for • Flexible deployment options. A technology that can be
a given use case, without the burden of a heavy software stack deployed on-premises or in the Cloud, where it is managed for
that can be expensive and inflexible. you, gives you the flexibility to deploy the solution based on your
business needs.
• Simplicity and ease of use. Intuitive graphical user interface
and no coding enable collaboration between IT and the business
users, and rapid adoption across the organization.
Syncsort is the global leader in Big Iron to Big Data software. We organize data everywhere to keep the world
working – the same data that powers machine learning, AI and predictive analytics. We use our decades of
experience so that more than 7,000 customers, including 84 of the Fortune 100, can quickly extract value
from their critical data anytime, anywhere. Our products provide a simple way to optimize, assure, integrate
and advance data, helping to solve for the present and prepare for the future. Learn more at syncsort.com.
www.syncsort.com
© 2018 Syncsort Incorporated. All rights reserved. All other company and product names used herein may be the trademarks of their respective companies.