Sie sind auf Seite 1von 7

Data Profiling: What, Why and How?

DEC 17, 2013 IN DATA PROFILING, DATA QUALITY

Intro to Data Quality


By: Jason Hover

Like it or not, many of the assumptions you have about your data are probably not
accurate. Despite our best efforts, gremlins inevitably find their way into our systems.
The end result poor data quality has a host of negative consequences. This brief
article will provide an introduction to data quality concepts, and illustrate how data
profiling can be used to improve data quality.

What Is Data Quality?

Data Quality is a measure of the accuracy, validity and completeness of data.

Is the data of sufficient quality to support the business purpose(s) for which it is
being used?
Are any specific issues within the data decreasing its suitability for these
business purposes?

Do Most Organizations Have A Data Quality Problem?

The short answer is yes; a study by Gartner estimated more than 25 percent of
critical data within Fortune 1000 enterprises to be flawed.

With the myriad of ways that data is captured (online transactions, automated
device capture, manual screen entry, spreadsheet uploads, direct database
changes), there are many opportunities for flawed data to enter source systems

So What? Does it Matter?

The costs of poor data quality are ongoing and substantial.

A report from The Data Warehouse Institute concluded that data quality
problems cost U.S. businesses more than $600 billion a year and that poor data
quality leads to failure and delays of many high profile IT projects
Lack of trust in the data due to poor data quality leads to reduced or
discontinued BI usage among Information consumers
Poor data quality also has legal/regulatory implications, especially in the wake of
Sarbanes-Oxley, as accurate data is required in order to have accurate financial
reporting

Data Profiling Overview


What Is Data Profiling, and How Can It Help With Data Quality?

Data Profiling is a systematic analysis of the content of a data source (Ralph Kimball).

You must look at the data; you cant trust copybooks, data models, or source
system experts
It is systematic in the sense that its thorough and looks in all the nooks and
crannies of the data
You have to know your data before you can fix it

What Types Of Analysis Are Performed?

Completeness Analysis
o How often is a given attribute populated, versus blank or null?
Uniqueness Analysis
o How many unique (distinct) values are found for a given attribute
across all records? Are there duplicates? Should there be?
Values Distribution Analysis
o What is the distribution of records across different values for a
given attribute?
Range Analysis
o What are the minimum, maximum, average and median values
found for a given attribute?
Pattern Analysis
o What formats were found for a given attribute, and what is the
distribution of records across these formats?

What Are Some Real-World Scenarios?

Data profiling can add value in a wide variety of situations. The basic litmus test is, Is
the quality of data important for this initiative? If the answer is yes, then data profiling
can help as it can quickly and thoroughly unveil the true content and structure of your
data.

Some example scenarios include:


Data Warehousing / Business Intelligence (DW/BI) projects
o These projects involve gathering data from disparate systems for
the purpose of reporting and analysis. Data profiling can help
ensure project success by:
Identifying data quality issues that must be corrected in
the source system
Identifying issues that can be corrected in ETL processing
Discovering unanticipated business rules
Even potentially providing a no-go decision on the project
as a whole
Data conversion / Migration projects
o These involve moving data from a legacy system to a new
system. Data profiling can help reduce project risk by:
Identifying data quality issues that must be handled in the
code that moves data from the legacy system to the new
system
Identifying data issues that may require a change to the
target (new) system
Source System Data Quality Initiative
o These projects endeavor to assess and improve the data quality
of a given source system, seeking to fix existing issues as well as
avoid those issues in the future. Data profiling can help maximize
project ROI by:
Highlighting the areas within the system suffering from the
most serious and/or numerous data quality issues
Identifying issues that may be the result of bad user input
or errant system interfaces

Data Profiling the Old Way


The Manual Approach

Traditionally, data profiling required a skilled technical resource who could manually
query the data source using Structured Query Language (SQL). There is often a
disconnect between the business analyst who knows what the data should be, and the
technical programmer who knows SQL.

Data Profiling the New Way


Benefits of Using Data Profiling Software
There are many benefits to be reaped by using software to automate the data profiling
process, including:

Increased Speed (resulting in hard dollar savings)


o Industry estimates for manual data profiling are approximately 3-
5 hours per attribute; by using a data profiling tool, this can be
reduced to 15-30 minutes per attribute
Sample ROI, assuming 1500 attributes: $281,250 minus
the cost of data profiling software
More Thorough Analysis
o With a manual approach, generally only a subset of the
attributes and the rows are tested; with a data profiling tool, a
thorough evaluation of the data can be performed
o Quote from DM Review: Smart organizations are abandoning
manual methods in favor of automated data profiling tools that
take much of the guesswork out of finding and identifying
problem data
Common Repository
o Data profiling tools provide a common repository for storing data
profile results and other key metadata such as notes made
during analysis
Data profile information is centralized
Entire team can share and leverage the information

Available Tools

A variety of options exist in the marketplace to help ease the challenge of data profiling.
They range in capabilities and price. Tools like Datiris Profiler and Informatica Data
Quality have been successfully deployed by myriad of organizations. Implemented in
the right way, such tools stand to sculpt the data profiling landscape, by reducing effort,
broadening scope, and improving consistency across all data quality initiatives.

Data Profiling: The First Step in Data Quality


When I think of data quality, I think of three primary components: data profiling, data

correction, and data monitoring. Data profiling is the act of analyzing your data contents.
Data correction is the act of correcting your data content when it falls below your
standards. And data monitoring is the ongoing act of establishing data quality standards in
a set of metrics meaningful to the business, reviewing the results in a re-occurring fashion
and taking corrective action whenever we exceed the acceptable thresholds of quality.

Today, I want to focus on data profiling. Data profiling is the analysis of data content in
conjunction with every new and existing application effort. We can profile batch data,
near/real time data, structured and non-structured data, or any data asset meaningful to
the organization. Data profiling provides organizations the ability to analyze large amounts
of data quickly in a systematic and repeatable process. Data profiling will provide your
organization with a methodical, repeatable, consistent, and metrics-based means to
evaluate your data. You should constantly evaluate your data given its dynamic nature.

I like to break down data profiling into the following categories:

Column Profiling, where all the values are analyzed within each column or
attribute. The objective is to discover the true metadata and uncover data content
quality problems

Dependency Profiling, where each attribute is compared in relation to every other


attribute within a table where were looking for dependency relationships. The focus
is on the discovery of functional dependencies; primary keys; and quality problems
due to data structure.

Redundancy Profiling, where data is compared between tables in determining


which attributes contain overlapping or identical sets of values. The purpose is to
identify duplicate data across systems; foreign keys, synonyms, and homonyms. All
values corrupting data integrity should be identified.

Transformation Profiling, where our processes (business rules) are examined to


determine our datas source(s); what transformation(s) are applied to data; and
explore data target(s).

Security Profiling, where it is determined who (or what roles) have access to the
data and what are they authorized to do with the data (add, update, delete, etc.).

Custom Profiling, where our data is analyzed in a fashion that is meaningful to our
Organization. For example, an organization might want to analyze data
consumption to determine if data is accessed more by web services, direct queries
or in some other fashion. For example, a large organization, improved system
throughput after determining how the business and its customer accessed their
information.
Most times, youll find IT and Business may have a few false assumptions concerning data
content and its quality. I believe the cost to the business is the risk of their future solvency
or failure to reach their maximum revenue potential. Sometimes leadership has difficulty
assessing their need for a data quality program due to an inability to assess the cost.
Sometimes, action is taken after a bug is discovered at midnight or a customer feels their
report is wrong. Data profiling allows your organization to be proactive and creates self-
awareness.

The Two Flavors of Data Profiling

There are two methods of data profiling: One based on sample and another based on
profiling data in place. Sample based profiling involves performing your analysis on a
random sample of data. For example, I might want to profile a 100 million row table. In my
effort to be efficient, my sample might be 30% of rows where I select every third row.
Sample base profiling requires me to store my sample in some temporary medium. Also,
sample based profiling requires you to ensure you have a representative sample of your
data. From a statistic standpoint, if my samples are too small, I can easily miss data
patterns or not properly identify the columns domain.

The second type of profiling involves profiling my data in place. Its treated as just another
query of my database. Generally, you will be profiling PROD and given the contention for
resources, youll want to run your queries when it has the least impact to the database.

Data Profiling Toolsets

You might be asking what toolsets are available to perform Data Profiling. You have lots of
options. Most of the ETL toolsets like Informatica and Data Stage offer built in Data
Profilers. There are stand-alone Data Profiling alternatives. And if your budget is zero, you
can write your own scripts to perform the analysis.

Data Profiling Insights

What data should I profile first? I like to focus on mission critical data first, like customer or
product information. If I have a data warehouse, data mart, or OLAP cube, Ill focus on their
Data Sources. Your OLTP environment is a good starting point since most analytic data
stores will pull from these sources.

Once you have performed your data profiling effort, what next? I like to map the results to
my outstanding applications bug fix reports. You can find a high correlation between the
known errors and what your data profile informs you of. And you can be proactive in the
discovery of errors that may reside in your data now. If I know my data contents, I can
create better and smaller test data sets for QA purposes. I like to share my findings with
QA, and develop a better test database and improve our test plans.
I can be proactive in my transformations where I can identify data misalignments where my
data sources contain values that are not being handled properly. And if there are data
anomalies where we have the same set of values stored in multiple locations, we can
address our data structure if needed.

Another useful insight comes in the data modeling structure. Do my tables reflect the
business at hand? Every organization will have tables that are processed each night, and
not used by anyone. When I profile, I like to match my data to my Business Intelligence
environment. When I identify a set of tables and reports that are not used by anyone we
can remove them from PROD to improve our performance. Also, I can match my data
sources to my staging area to determine if my processes are optimal.

There are so many great uses for data profiling. To start, I recommend looking at your
business strategy and assessing your data quality cost. Once youve assessed the cost,
determine if your current data quality strategy aligns to your business needs. A good data
profile strategy should complement your business strategy and provide the business
tangible bottom-line results.

What issues have you overcome in data profiling? How did you work through any issues?

Das könnte Ihnen auch gefallen