Sie sind auf Seite 1von 12

Data Profiling

The Basic Principles of Data Profiling

Enhancing BI-DW Capabilities


1) Data Quality
Data Quality enables you to solve data quality problems and realize real,
sustainable data quality improvements. This is an important step towards enhancing BIDW capabilities.
2) Data Profiling
Data profiling helps you improve your data model and achieve what is called the
3NF. Just like data quality problems, it is often the case that the data models are
also imperfect and need improvements.
3) Data Mining
Data mining helps you find the hidden patterns (or reports) in your data
which you are not aware of. This step is supposed to add new reports to the
repertoire of standard business reports.

Data Profiling Overview


Data profiling helps you create data model of the 3rd normal form, based solely on data
available in the source system.
In order to create a data model of the 3rd normal form we need the following
information:
1)
2)
3)

Domain - Column Data type and Length.


Dependency Primary Key.
Relationship - Foreign Key .

Data profiling is divided into three steps. Single Column Profiling to get column domain
information. Table Structural Profiling to get dependency information, and Cross Table
Profiling to get relationship information.

Single Column Profiling


This gives you column domain information, which is used to determine the correct data type and
length of a particular column. For example, if the values in a column all have six digits and look like
040500, the data type could be either INTEGER or a DATE in mmddyy or ddmmyy format. 1
Column profiling produces a list of inferred data types which fit the Column data. Below are some of
the reports generated by SSIS Data Profiling Task.
1. Column Length Distribution Profile: Reports all the distinct lengths of string values in the
selected column and the percentage of rows in the table that each length represents.
2. Column Statistics Profile: Reports statistics, such as minimum, maximum, average, and standard
deviation for numeric columns, and minimum and maximum for date time columns.
3. Column Null Ratio Profile: Reports the percentage of null values in the selected column
4. Profile Time: How much time it took to profile the sample data.
1. Informatica Data Explorer 8.6.2 User Guide

What is a Domain?
A simple example of a Domain is the list of United States state abbreviations. The
Domain could be implemented as a CHAR(2) and would contain the following valid
value set: AL, AK, AR, CA, CO, CT, DE, DC, FL, GA, HI, ID, IL, IN, IA, KS, KY, LA,
ME, MD, MA, MI, MN, MS, MO, MT, NE, NV, NH, NJ, NM, NV, NC, ND, OH, OK,
OR, PA, RI, SC, SD, TN, TX, UT, VT, VA, WA, WV, WI, WY. 2
Many Columns can share the same Domain. Columns which share the same Domain may
be Synonym candidates.
A Domain is defined as the set of all valid values for a Column or set of Columns.
Domains contain target data type information, a user-defined list of valid values, and a
list of valid patterns. Each Schema has its own set of Domains.
2. Informatica Data Explorer 8.6.2 User Guide

Table Structural Profiling


Table Structural profiling discovers functional dependencies, or just dependencies. Table structural profiling asks the
question, If I know a value in one column (or the values in a set of columns) can I positively determine the value in
another column?
EmployeeID
001
AJE11111N
002
AJE22222N
003
AJE33333N
004
AJE44444N

PAN
Sunil
Sunil
Rohit
Rohit

FirstName
A
J
B
J

MI
Singh
Sharma
Kumar
Kumar

LastName

If you ran a dependency profile for this Table, you would find the following dependencies, among others:
EmployeeID PAN
EmployeeID FirstName
EmployeeID LastName
EmployeeID MI
PAN EmployeeID
PAN FirstName
PAN LastName
PAN MI
LastName FirstName
FirstName + MI LastName

Table Structural Profiling


Continued
The list in the previous slide represents TRUE dependencies. Now lets take a look at the below dependencies.
FirstName LastName
FirstName + LastName PAN
The first one is not a true dependency because FirstName does not positively determine LastName, in that S unil
could be Singh or Sharma. Similarly FirstName + LastName doesnt uniquely determine PAN.
If you add the first list to the Dependency Model, you would get two keys:
EmployeeID
PAN
However, only one of them can be a primary key, the other key is called an alternate key. 3
3. Informatica Data Explorer 8.6.2 User Guide

Cross Table Profiling


Cross Table profiling compares Columns in a Schema, determining which ones contain similar
values. This profile can determine whether a column or set of columns is appropriate to serve as a
foreign key between the selected tables.
Cross Table profiling can find the following types of redundancies:
Redundant data to eliminate by creating Synonyms.
Redundant data that is intentionally redundant to improve database performance. You may still
want to synonym these Column pairs to allow the normalizer to create a true third normal form
(3NF).
Data that looks redundant but actually represents different business facts (homonyms).
For example, the Columns Employee_Age and Quantity_On_Hand may appear as a pair of
redundant Columns if both contain integer values under 100. Although the values in these Columns
are similar, the Columns actually have very different business meanings. 4
4. Informatica Data Explorer 8.6.2 User Guide

Synonyms
Two or more Columns that have the same business meaning are called Synonyms. Suppose a Schema
contains the following two Tables:
EmployeeID
Name
001
Bob Smith 002
002
Jane Goodall
003
Royal Robbins
EmpID Salary
001
25,000
002
50,000
003
25,000

MgrID
01-15-00
005
002

Hiredate
06-12-99
09-08-99

Stock Options
100
500
100

Both relations contain employee data, but they are defined separately to segregate public and private
information. The EmpID and EmployeeID Columns have the same business meaning and can be
meaningfully combined into a single Column. In contrast, look at how the MgrID column is used in
the Employee Table. Even though MgrID uses similar values to EmployeeID, it represents a different
role in the database. Therefore you would not define MgrID and EmployeeID as Synonyms.

Synonyms
Continued.
Normalization has the following impacts on Synonyms:
If two or more Columns made Synonyms represent the identical construct, they will collapse into
one Column in the normalized model.
If two Columns made Synonyms represent a parent-child relationship, they will result in two
Columns in two Tables, with one Column participating in a primary Key and the other in the
corresponding Foreign Key. 5
5. Informatica Data Explorer 8.6.2 User Guide

Data Profiling Tools


1)

Informatica Data Explorer 8x

2)

Informatica PowerCenter 8x (Profiling option in Source Analyzer)

3)

Oracle Warehouse Builder 10g (Data Profiling node in the Project Explorer)

4)

SQL Server Integration Service (Data Profiling Task)

5)

IBM InfoSphere (Information Analyzer)

QnA

Das könnte Ihnen auch gefallen