Sie sind auf Seite 1von 8

Data Quality Project Estimation and

Scheduling Factors
Challenge
This Best Practice is intended to assist project managers who must estimate the time and
resources necessary to address data quality issues within data integration or other data-
dependent projects.
Its primary concerns are the project estimation issues that arise when you add a discrete data
quality stage to your data project. However, it also examines the factors that determine when,
or whether, you need to build a larger data quality element into your project.
Description
At a high level, there are three ways to add data quality to your project:
Add a discrete and self-contained data quality stage, such as that enabled by using
pre-built Informatica Data Quality (IDQ) processes, or plans, in conjunction with
Informatica Data Cleanse and Match.
Add an expanded but finite set of data quality actions to the project, for example in
cases where pre-built plans do not fit the project parameters.
Incorporate data quality actions throughout the project.
This document should help you decide which of these methods best suits your project and
assist in estimating the time and resources needed for the first and second methods.
Using Pre-Built Plans with Informatica Data Cleanse and
Match
Informatica Data Cleanse and Match is a cross-application solution that enables PowerCenter
users to add data quality processes defined in IDQ to custom transformations in PowerCenter.
It incorporates the following components:
Data Quality Workbench, a user-interface application for building and executing
data quality processes, or plans.
Data Quality Integration, a plug-in component for PowerCenter that integrates
PowerCenter and IDQ.
At least one set of reference data files that can be read by data quality plans to
validate and enrich certain types of project data. For example, Data Cleanse and
Match can be used with the North America Content Pack, which includes pre-built
data quality plans and complete address reference datasets for the United States and
Canada.
Data Quality Engagement Scenarios
Data Cleanse and Match delivers its data quality capabilities out of the box; a PowerCenter
user can select data quality plans and add them to a Data Quality transformation without
leaving PowerCenter. In this way, Data Cleanse and Match capabilities can be added into a
project plan as a relatively short and discrete stage.
In a more complex scenario, a Data Quality Developer may wish to modify the underlying
data quality plans or create new plans to focus on quality analysis or enhancements in
particular areas. This expansion of the data quality operations beyond the pre-built plans can
also be handled within a discrete data quality stage.
The Project Manager may decide to implement a more thorough approach to data quality and
integrate data quality actions throughout the project plan. In many cases, a convincing case
can be made for enlarging the data quality aspect to encompass the full data project. (Velocity
contains several tasks and subtasks concerned with such an endeavor.) This is well worth
considering. Often, businesses do not realize the extent to which their business and project
goals depend on the quality of their data.
The project impact of these three types of data quality activity can be summarized as follows:
DQ Approach Estimated Project impact
Simple stage 10 days, 1-2 Data Quality Developers
Expanded data quality stage 15-20 days, 2 Data Quality Developers, high visibility
to business
Data quality integrated with data project Duration of data project, 2 or more project roles,
impact on business and project objectives
Note: The actual time that should be allotted to the data quality stages noted above
depends on the factors discussed in the remainder of this document.
Factors Influencing Project Estimation
The factors influencing project estimation for a data quality stage range from high-level
project parameters to lower-level data characteristics. The main factors are listed below and
explained in detail later in this document.
Base and target levels of data quality
Overall project duration/budget
Overlap of sources/Complexity of data joins
Quantity of data sources
Matching requirements
Data volumes
Complexity and quantity of data rules
Geography
Determine which scenario out of the box (Data Cleanse and Match), expanded Data Cleanse
and Match, or a thorough data quality integration best fits your data project by considering
the projects overall objectives and its mix of factors.
The Simple Data Quality Stage
Project managers can consider the use of pre-built plans with Data Cleanse and Match as a
simple scenario with a predictable number of function points that can be added to the project
plan as a single package.
You can add the North America Content Pack plans to your project if the project meets most
of the following criteria. Similar metrics apply to other types of pre-built plans:
Baseline functionality of the pre-built data quality plans meets 80 percent of the
project needs.
Complexity of data rules is relatively low.
Business rules present in pre-built plans need minimum fine-tuning.
Target data quality level is achievable (i.e., <100 percent).
Quantity of data sources is relatively low.
Overlap of data sources/complexity of database table joins is relatively low.
Matching requirements and targets are straightforward.
Overall project duration is relatively short.
The project relates to a single country.
Note that the source data quality level is not a major concern.
Implementing the Simple Data Quality Stage
The out-of-the-box scenario is designed to deliver significant increases in data quality in
those areas for which the plans were designed (i.e., North American name and address data)
in a short time frame. As indicated above, it does not anticipate major changes to the
underlying data quality plans. It involves the following three steps:
1. Run pre-built plans.
2. Review plan results.
3. Transfer data to the next stage in the project and (optionally) add data quality plans to
PowerCenter transformations.
While every project is different, a single iteration of the simple model may take
approximately five days, as indicated below:
Run pre-built plans (2 days)
Review plan results (1 day)
Pass data to the next stage in the project and add plans to PowerCenter
transformations (2 days)
Note that these estimates fit neatly into a five-day week but may be conservative in some
cases. Note also that a Data Quality Developer can tune plans on an ad-hoc basis to suit the
project. Therefore you should plan for a two week simple data quality stage.
Step - Simple Stage Days, week 1 Days, week 2
Run pre-built plans 2
Review plan results
1
Fine-tune pre-built plans if necessary
Re-run pre-built plans 2
Review plan results with stakeholders
Add plans to PowerCenter transformations and define mappings
2
Run PowerCenter workflows 1
Review results/obtain approval from stakeholders 1
Approve and pass all files to the next project stage 1
Expanding the Simple Data Quality Stage
Although the simple scenario above allows for the data quality components to be treated as a
black box, it allows for modifications to the data quality plans. The types of plan tuning that
developers can undertake in this time frame include changing the reference dictionaries used
by the plans, editing these dictionaries, and re-selecting the data fields used by the plans as
keys to identify data matches. The above time frame does not guarantee that a developer can
build or re-build a plan from scratch.
The gap between base and target levels of data quality is an important area to consider when
expanding the data quality stage. The Developer and Project Manager may decide to add a
data analysis step in this stage, or even decide to split these activities across the project plan
by conducting a data quality audit early in the project, so that issues can be revealed to the
business in advance of the formal data quality stage. The schedule should allow for sufficient
time for testing the data quality plans and for contact with the business managers in order to
define data quality expectations and targets.
In addition:
If a data quality audit is added early in the project, the data quality stage grows into a
project-length endeavor.
If the data quality audit is included in the discrete data quality stage, the expanded,
three-week Data Quality stage may look like this:
Step- Enhanced DQ Stage Days,
Week 1
Days,
Week 2
Days,
week 3
Set up and run data analysis plans
Review plan results
1-2
Conduct advance tuning of pre-built
plans
Run pre-built plans
2
Review plan results with stakeholders 1
Modify pre-built plans or build new
plans from scratch
2
Re-run the plans 2
Review plan results/obtain approval
from stakeholders
1
Add approved plans to PowerCenter
transformations, define mappings
2
Run PowerCenter workflows 1
Review results/obtain approval from
stakeholders
1
Approve and pass all files to the next
project stage
1

Sizing Your Data Quality Initiatives
The following section describes the factors that affect the estimated time that the data quality
endeavors may add to a project. Estimating the specific impact that a single factor is likely to
have on a project plan is difficult, as a single data factor rarely exists in isolation from others.
If one or two of these factors apply to your data, you may be able to treat them within the
scope of a discrete DQ stage. If several factors apply, you are moving into a complex
scenario and must design your project plan accordingly.
Base and Target Levels of Data Quality
The rigor of your data quality stage depends in large part on the current (i.e., base) levels of
data quality in your dataset and the target levels that you want to achieve. As part of your
data project, you should run a set of data analysis plans and determine the strengths and
weaknesses of the proposed project data. If your data is already of a high quality relative to
project and business goals, then your data quality stage is likely to be a short one!
If possible, you should conduct this analysis at an early stage in the data project (i.e., well in
advance of the data quality stage). Depending on your overall project parameters, you may
have already scoped a Data Quality Audit into your project. However, if your overall project
is short in duration, you may have to tailor your data quality analysis actions to the time
available.
Action:If there is a wide gap between base and target data quality levels, determine whether a
short data quality stage can bridge the gap. If a data quality audit is conducted early in the
project, you have latitude to discuss this with the business managers in the context of the
overall project timeline. In general, it is good practice to agree with the business
to incorporate time into the project plan for a dedicated Data Quality Audit. (See Task 2.8 in
theVelocity Work Breakdown Structure.)
If the aggregated data quality percentage for your projects source data is greater than 60
percent, and your target percentage level for the data quality stage is less than 95 percent,
then you are in the zone of effectiveness for Data Cleanse and Match.
Note: You can assess data quality according to at least six criteria. Your business may need
to improve data quality levels with respect to one criterion but not another. See the Best
Practice document Data Cleansing .
Overall Project Duration/Budget
A data project with a short duration may not have the means to accommodate a complex data
quality stage, regardless of the potential or need to enhance the quality of the data involved.
In such a case, you may have to incorporate a finite data quality stage.
Conversely, a data project with a long time line may have scope for a larger data quality
initiative. In large data projects with major business and IT targets, good data quality may be
a significant issue. For example, poor data quality can affect the ability to cleanly and quickly
load data into target systems. Major data projects typically have a genuine need for high-
quality data if they are to avoid unforeseen problems.
Action: Evaluate the project schedule parameters and expectations put forward by the
business and evaluate how data quality fits into these parameters.
You must also determine if there are any data quality issues that may jeopardize project
success, such as a poor understanding of the data structure. These issues may already be
visible to the business community. If not, they should be raised with the management. Bear in
mind that data quality is not simply concerned with the accuracy of the data values it can
encompass the project metadata also.
Overlap of Sources/Complexity of Data Joins
When data sources overlap, data quality issues can be spread across several sources. The
relationships among the variables within the sources can be complex, difficult to join
together, and difficult to resolve, all adding to project time.
If the joins between the data are simple, then this task may be straightforward. However, if
the data joins use complex keys or exist over many hierarchies, then the data modeling stage
can be time-consuming, and the process of resolving the indices may be prolonged.
Action: You can tackle complexity in data sources and in required database joins within a
data quality stage, but in doing so, you step outside the scope of the simple data quality stage.
Quantity of Data Sources
This issue is similar to that of data source overlap and complexity (above). The greater the
quantity of sources, the greater the opportunity for data quality issues to arise. The number of
data sources has a particular impact on the time required to set up the data quality solution.
(The source data setup in PowerCenter can facilitate the data setup in the data quality stage.)
Action: You may find that the number of data sources correlates with the number of data
sites covered by the project. If your project includes data from multiple geographies, you step
outside the scope of a simple data quality stage.
Matching Requirements
Data matching plans are the most performance-intensive type of data quality plan. Moreover,
matching plans are often coupled to a type of data standardization plan (i.e., grouping plan)
that prepares the data for match analysis.
Matching plans are not necessarily more complex to design than other types of plans,
although they may contain sophisticated business rules. However, the time taken to execute a
matching plan is exponentially proportional to the volume of data records passed through the
plan. (Specifically, the time taken is proportional to the size and number of data groups
created in the grouping plans.)
Action: Consult the Best Practice on Effective Data Matching Techniques and determine
how long your matching plans may take to run.
Data Volumes
Data matching requirements and data volumes are closely related. As stated above, the time
taken to execute a matching plan is exponentially proportional to the volume of data records
passed through it. In other types of plans, this exponential relationship does not exist.
However, the general rule applies: the larger your data volumes, the longer it takes for plans
to execute.
Action: Although IDQ can handle data volumes measurable in eight figures, a dataset of
more than 1.5 million records is considered larger than average. If your dataset is measurable
in millions of records, and high levels of matching/de-duplication are required, consult the
Best Practice on Effective Data Matching Techniques.
Complexity and Quantity of Data Rules
This is a key factor in determining the complexity of your data quality stage. If the Data
Quality Developer is likely to write a large number of business rules for the data quality plans
as may be the case if data quality target levels are very high or relate to precise data
objectives then the project is de facto moving out of Data Cleanse and Match capability and
you need to add rule-creation and rule-review elements to the data quality effort.
Action: If the business requires multiple complex rules, you must scope additional time for
rule creation and for multiple iterations of the data quality stage. Bear in mind that, as well as
writing and adding these rules to data quality plans, the rules must be tested and passed by the
business.
Geography
Geography affects the project plan in two ways:
First, the geographical spread of data sites is likely to affect the time needed to run
plans, collate data, and engage with key business personnel. Working hours in
different time zones can mean that one site is starting its business day while others are
ending theirs, and this can effect the tight scheduling of the simple data quality stage.
Secondly, project data that is sourced from several countries typically means multiple
data sources, with opportunities for data quality issues to arise that may be specific to
the country or the division of the organization providing the data source.
There is also a high correlation between the scale of the data project and the scale of the
enterprise in which the project will take place. For multi-national corporations, there is rarely
such a thing as a small data project!
Action: Consider the geographical spread of your source data. If the data sites are spread
across several time zones or countries, you may need to factor in time lags to your data
quality planning.

Das könnte Ihnen auch gefallen