Sie sind auf Seite 1von 10

What is DataFlux?

So what is DataFlux? Yes, a leader in data quality, it’s both a company and a product; better
stated, DataFlux (the company) provides a suite of tools (often simply called DataFlux) that provide
data management capabilities, with a focus in data quality.DataFlux’s tools can do a lot of really
neat things; I’d say it’s a must-have for Sales & Marketing, and it’d benefit most enterprises out
there in other ways. To see what all of this pomp is about, let’s use an example. Think of these
entries in your company’s xyz system:

Name Address City,State,Zip Phone


1600 Pennsylvania 202-456-
Mr. Victor Johnson Washington, DC 20500
Avenue NW 1414
1600 Pennsylvania
Victor Jonson, JD Washington, DC 456-1414
Avenue
SAN LUIS OBISPO, CA (805) 555-
VICTOR JOHNSON 255 DONNA WAY
93405 1212
Bill Shares 1050 Monterey St SLO, CA 93408 8055444800
Doctor William San Luis Obispo,
1052 Monterrey Ave n/a
Shares California
omaha, nebraska,
william shares, sr 1001 cass street
68102

In this example, a human could probably pretty easily figure out that the first two Victors are
probably one and the same and that Bill in SLO and William in San Luis Obispo are also the same
person. The other records might be a match, but most of us would agree that we can’t be sure
based on name alone. Furthermore, it is obvious that some data inconsistencies exist such as
name prefixes and suffixes, inconsistent casing, incomplete address data, etc.; DataFlux can’t (and
shouldn’t try) to fix all of these quirks, but it should at least be able to reconcile the differences,
and, if we choose, we should be able to do some data cleanup automatically. So let’s get started.
I’ll open up dfPower Studio.
This interface is new in version 8 and helps provide quick access to the functions one would use
most often. This change is actually helpful (as opposed to some GUI changes made by companies)
by combining a lot of the settings into a central place.

In my case, I’ll start Architect by clicking on the icon in the top left, where most design takes place.
On this note I guess I should say that Architect is the single most useful product in the suite(in my
opinion anyway), and it’s where I’ll spend most of my time in this posting.
On the left panel you’ll see a few categories. Let me explain what you’ll find each one (skip over
this next section if you want):

Data Inputs – Here you’ll find nodes allowing you to read from ODBC sources, text files, SAS data
sets (DataFlux is a SAS company), and more. I’ll cover one other data input later…

Data Outputs – Similar to inputs, you’ll find various ways of storing the output of the job.

Utilities – Utilities contain what many would refer to as “transformations”, which might be helpful
to know if you’ve worked with Informatica or another ETL (Extract, Transfer, Load) tool.

Profiling – Most nodes here help provide a synopsis of the data being processed. Another
DataFlux tool is dedicated to profiling – in some ways these nodes are a subset of the other’s
functionality, but there’s one primary difference. Here the output of profiling can be linked to other
actions.

Quality – Here’s where some of DataFlux’s real magic takes place, so I’ll go through the task of
describing each node briefly: Gender Analysis (determine gender based on a name field),
identification analysis (e.g. is this a person’s name or an organization name?), parsing (we’ll see
this), standardization (we’ll see one application of this), Change Case (although generally not too
complicated, this gets tricky with certain alphabets), Right Fielding (move data from the “wrong”
field to the “right”), Create Scheme (new in Version 8 – more of an advanced topic), and Dynamic
Scheme Application (new in Version 8 – another advanced topic)

Integration – Another area where magic takes place. We’ll see this in this post.

Enrichment – As the name suggests, these nodes help enrich data, i.e. they provide data that’s
missing in the input. This section includes: address verification (we’ll see this), geocoding
(obtaining demographic and other information based on an address) and some phone number
functions (we’ll see one example).

Enrichment (Distributed) – Provides the same functionality as I just described, but distributed
across servers for performance/reliability gains.

Monitoring – Allows for action to take place on a data trigger, e.g. email John if sales fall under
$10K.
Now that we’ve gone through a quick overview of Architect’s features, let’s use them. I’ll first drag
my data source on to the page and double click on it to configure its properties. For my purposes
today I’ll read from a delimited text file I created with the data I described at the beginning of the
article. I can use the “Suggest” button to populate the field names based on the header of the text
file.

What’s nice here is I can have auto-preview on (which by the way drives me crazy), or I can turn it
off and press F5 for a refresh, which shows the data only when asked. Either way, the data will
appear in my preview window (instant gratification is one of the great things about Architect).
Next, I’ll start out my data quality today by verifying these addresses. I do this by dragging on the
Address Verification (US/Canada) node. After attaching the node to Text File Input 1 and double-
clicking on the node, in the input section I map my fields to the ones expected by DataFlux and in
another window I specify what outputs I’m interested in. I’ve selected a few fields here but there
are many other options available.
You’ll notice here I’ve passed through only the enriched address fields in the output. I could have
also kept the originals side by side, plus I could have added many more fields to the output, but
these will suffice for now (It’d be tough to fit on the screen here). Already you can see what a
difference we’ve made. I want to point out just two things here:

1. There is one “NOMATCH”. This is likely to have happened because too many fields are wrong
and the USPS data verification system is designed not to guess too much…

2. 1052 Monterey St is an address I made up and consequently the Zip-4 could not be determined.
The real address for the courthouse in San Luis Obispo is 1050 Monterey St. If I would have used
that, the correct Zip-4 would have been calculated. So why did we get a US_Result_Code of “OK”?
This is because the USPS system recognizes 1052 as an address within a correct range.

Nonetheless, pretty neat, eh? I’d also like to point out that the county name was determined
because I added this output when I configured the properties. At our company we’ve configured
DataFlux to comply with USPS Publication 28, which among other things, indicates that addresses
should always be uppercased. For this reason you see this here. Having said this, you have the
option to propercase the result set if you’d like.

Moving on, let’s clean up the names. It’d be nice if we could split the names into a first & last
name. First, I reconfigured the USPS properties to allow additional outputs (the original name &
phone number). Next, I dragged the Parsing node onto the screen and configured its properties to
identify what language & country the text was based on (DataFlux supports several locales and in
version 8 supports Unicode). After that, I can preview as before. Note how well DataFlux picked out
the first, middle and last names, not to mention the prefixes and suffixes.
For simplicity, I’ll remove the Parse step I just added and use a Standardize node instead. Here in
the properties I’ll select a “Definition” for the name and phone inputs. There are many options to
choose from including things like: Address, Business Title, City, Country, Name, Date, Organization,
Phone, Postal Code, Zip, and several others. Let’s see what this does…
You might be wondering how DataFlux does this. After all, if the input name were “Johnson, Victor”
would it have correctly standardized the name to “Victor Johnson”? The answer here is yes.
DataFlux utilizes several algorithms and known last names, first names, etc. to analyze the
structure and provide a best “guess.” Of course this means that with very unusual names the
parsing algorithm could make a mistake; nonetheless I think that most users would be surprised
how good this “guessing” can be, especially with the help of a comma. By that I mean that the
placement of a comma in a name greatly enhances the parser ability to determine the location of
the last name. If you’re interested in learning more about this let me know and perhaps I’ll write
another blog to go into the details. All in all, it’s pretty neat stuff and of course the good part is that
it is customizable. This helps if someday you want to write a standardization rule for your
company’s specific purpose.

Let’s move on. I’m next going to make “Match Codes.” Match codes allow duplicate identification
(and resolution). For example, often times (perhaps most of the time), nothing can be done about
data in a system once it is entered. For example if a name is Rob, we can’t assume the real name
is Robert yet we may have a burning desire to do something like that to figure out that 1 record is
a potential duplicate of another… this is where match codes come in. Here’s the section of the
Match Codes Properties window where we assign the incoming fields to the Definition. This step is
important because intelligent parsing, name lookups, etc. occur based on the data type.

Let’s preview a match code to see what this does.


I couldn’t get the whole output to fit on the screen here, but I think the match codes seen in the
name and the address will get my point across. Here you can see that match codes ignore minor
spelling differences, take into account abbreviations, nicknames, etc. Why is this so significant? We
now have an easy way to find duplicates! Match codes could be stored in a database and allow
quick checks for duplicates! Let’s move on to see more… I’m now going to use Clustering next to
see how duplication identification can be done. First, I’ll set the clustering rules in the Properties
window (note that I use the match code instead of the actual value for the rule):

And let’s preview…


Note that the cluster numbers are the same for records that match, based on the clustering
conditions I set a moment ago. Pay special attention to the fact that our Bill & William Shares didn’t
match. Why? Well because of the clustering conditions I set. We could modify our Quality
Knowledge Base (QKB) to indicate that SLO = San Luis Obispo or I could remove the City as a
clustering condition, together with lowering the sensitivity on the address match code (sensitivities
range from 50-95) and the two would match. Let’s do this to be sure:

There are a lot of really neat things that DataFlux can do. I’ll try to post a thing or two out here now
and again if I see anyone interested…