Sie sind auf Seite 1von 5

Data Profiling with Informatica Data Quality

By: Sally McCormack


One of the first steps in solving a data quality problem is to perform data profiling. As seen in
Jason Hovers article, Data Profiling: What, Why and How?, data profiling allows you to analyze
your data to determine what it looks like and what problems exist in the data. Manual data
profiling can be performed; however, using software such as Informatica Data Quality allows
both data stewards and developers to collaboratively profile the data in a common repository
more quickly, often yielding a more thorough analysis.
Informatica Data Quality offers profiling capabilities within both the Informatica Analyst and
Informatica Developer tools. Within Analyst, data stewards can quickly create a profile of the
data:

Data Profiling With Informatica Data Quality Image (A.1)

As seen in image (A.1), a data steward can see that there are 4 records with a NULL value in
column ADDR2. The data steward can also see that the values are mixed case. Here, the data
steward may determine that NULL values are not valid and that the values should be
standardized to uppercase. The values of San Diego, SD, SAN DIEGO should also all be SAN
DIEGO. The data steward can add this as a comment for the developer to create a rule (A.2).

Data Profiling With Informatica Data Quality Image (A.2)


In the Developer tool, the same profile can be viewed with the comments provided by the data
steward (A.3).

Data Profiling With Informatica Data Quality Image (A.3)


The developer can now apply profile rules to the ADDR2 column based on the business rules
provided. Applying profiling rules allows the data steward to identify possible inconsistencies in
the data and potentially correct the data in the source system. Note: The output of the profile rule
is a virtual column in the profile; the rule has not been applied to the data physically until a
mapping has been created in Developer.
Informatica Data Quality provides a number of pre-built rules, which can be applied directly to
the profile within either the Analyst or Developer tool. In this example, we will use the pre-built
rules of rule_Completeness and rule_UpperCase and create a new rule to standardize the values
for SAN DIEGO.
Before creating the rule to standardize values for SAN DIEGO, we need to create a list of valid
values. In Informatica Data Quality, the list of valid values can be created as a reference table.
The reference table will be used as a lookup to determine what value should be returned. TIP: A
quick way to create a reference table with valid values is to add the values from the data that has
been profiled. Right click on the value and select Add to Reference Table. (A.4).

Data Profiling With Informatica Data Quality Image (A.4)


Once the table is created, the list of valid values can be added (A.5). The ADDR2 column serves
as the valid value. If the value in column1, column2, or column3 exists, it will return the valid
value found in column ADDR2. The reference table can then be maintained through the Analyst
tool by the data steward (where an audit trail is captured).

Data Profiling With Informatica Data Quality Image (A.5)


TIP: If the data steward does not want to edit the table to include additional columns, the same
valid value in column ADDR2 can be added and the additional values can be listed in column1,
column2, column3 as highlighted in Image (A.6).

Data Profiling With Informatica Data Quality Image (A.6)


Now we can create a rule in the Developer tool to use the reference table. After applying the rule
in the profile, we can see that there are now 3 records with the value of SAN DIEGO (A.7).

Data Profiling With Informatica Data Quality Image (A.7)


Although we are using 3 different profile rules, we can take it one step further and combine the
rules into one profile rule. The rules are seen as mapplets in Developer. In this example, we will
create one mapplet named rule_ADDR2 that will include 3 mapplets (one for each rule
previously identified) (A.8).

Data Profiling With Informatica Data Quality Image (A.8)


Taking the output from rule_Completeness, we will determine if the record is valid (A.9).
Data Profiling With Informatica Data Quality Image (A.9)
TIP: Make sure to validate the mapplet as a Rule so that it will appear in the Analyst tool (A.10).

Data Profiling with Informatica Data Quality Chart (A.10)


After this rule is applied in the profile and re-run, the data steward can see how many records are
valid or invalid. The data steward can then drilldown on the invalid records.
Data Profiling with Informatica Data Quality Chart (A.11)
As seen in Image (A.11), there are 4 invalid records because they are NULL. The data steward
can then export this data, perform further analysis, update the source system, update the
reference table, and/or work with the developer to update the rule. Once profiling of the data is
complete and all rules have been identified/updated, then the rules can be physically applied to
the data and cleansed.
This data profiling example illustrates one of many techniques that can be used within
Informatica Data Quality. Collaboration amongst the data steward and developer during the data
profiling process are key in taking the first steps in solving a data quality problem.