Beruflich Dokumente
Kultur Dokumente
Module Objectives
After completing this topic, you should be able to explain:
■ Virtual MDM Algorithms Configuration
2
© 2014 IBM Corporation
Information Management
Reiterate
Reiterate
DATA
DATA
Jim Smith SMITH:JIM JN + SNT Jim Smith vs. Jim Smith = 6.1
Jim Smith vs. James Smith = 5.9
Jim Smith vs. John Smith = 4.2
Fields: LastName, FirstName, Middle, Suffix, NATLID, Gender, ZipCode, StreetLine1, HomePhone, MobilePhone, EyeColor
12
© 2014 IBM Corporation
Information Management
Standardization functions
The Expanded Address function converts strings to UPPER CASE, isolates each distinct String and
Number, and identifies them with an “S-“ or “N-“. These elements are delimited by a colon (:) and
common words like Street, Apartment, and North are abbreviated to “ST”, “APT”, and “N”.
Standardized Addresses
Address 1 N-208:S-W:S-CHICAGO:S-APT:N-3:S-OAK:S-PARK:S-IL:N-60302
Address 2 N-208:S-W:S-CHICAGO:S-AVE:S-APT:N-3:S-OAK:S-PARK:S-IL:N-60302
Address 3 N-208:S-W:S-CHICAGO:S-APT:N-3:S-OAK:S-PARK:S-IL:N-60302
Address 4 N-3:N-208:S-W:S-CHICAGO:S-AVE:S-OAK:S-PARK:S-IL:N-60302
14
© 2014 IBM Corporation
Information Management
What is bucketing?
Bucketing speeds
searching by enabling
Buckets act like a table index,
sub-set search except they reference data in
multiple tables
Organizing buckets
Bucketing defines how a record is organized, each record may have several buckets
Organizing buckets
Bucketing defines how a record is organized, each record may have several buckets
PHONE NATLID
JILL+MONTAUK JOHN+MONTAUK PAULA+MONTAUK
0236779 012337789
6394223249508127562 198205+ALKSNT DOB is added to a phonetic form of Alexander (the formal name for Alec)
8128495530969973259 85038+SPKL A sorted Zip Code (85038 > 03588) is added to phonetic form of Speegle
1 2 3 4 5
1, 3, 5 4 2 1, 3, 5 1, 4
23
© 2014 IBM Corporation
Information Management
Comparison methodologies
There are several forms of comparisons:
– Exact Match: Do the two values match, Yes or No?
– Starts With: Do the two values start with the same initial or digits?
– Edit Distance: How many edits does it take to make the values match?
– Phonetics: Do the values have the same key sound markers?
– Equivalency: Could one of the values be a nickname or alias?
These techniques are bundled into a set of functions that blend approaches
– Name (comprehensive): Approach uses all five methods, best match wins
– Address & Phone: Uses all methods for address, but edit distance for
phone
– Date: Approach blends exact match and edit distance
Comparison functions
There are several comparison function categories:
– String Pair
– Address & Phone
– Name
– Date
– Edit Distance
– Equivalency
– False Positive Filter
– Geocode
– Height & Weight
– US Zipcode
Edit distance
Edit distance is measures of similarity between two strings by counting the
number of edits needed to make the strings match
There are four kinds of edits:
– Transposition: Two adjacent characters that are swapped (STEIN vs.
STIEN)
– Insertion: Need to add another character (MARGRET vs. MARGARET)
– Deletion: Need to remove and extra character (662071 vs. 66271)
– Replacement: Swap out a character with another one (GORDON vs.
GORTON)
Distance Meaning
0 Exact Match
1 One edit off
2 Two edits off
>12 Rarely calculated beyond 12
30 © 2014 IBM Corporation
Information Management
372817271^M^SMITH:JOHN:H:SR:.^…
Standardization:
Data is pulled from the attribute, formatted for more
efficient comparison, and stored in the MEMCMPD
table according to the role number.
35
© 2014 IBM Corporation
Information Management
Standardization:
Data is pulled from the attribute, formatted for more
efficient comparison, and stored in the MEMCMPD
table according to the role number.
Bucketing:
The bucket function pulls data from MEMCMPD table, transforms it according to the bucketing
function design (As Is, Sorted, YYYYMM, Equivalence & Phonetic), then sends data to the
bucketing group, which creates the actual buckets and stores them in the MEMBKTD table.
36
© 2014 IBM Corporation
Information Management
Standardization:
Data is pulled from the attribute, formatted for more
efficient comparison, and stored in the MEMCMPD
table according to the role number.
Bucketing:
The bucket function pulls data from MEMCMPD table, transforms it according to the bucketing
function design (As Is, Sorted, YYYYMM, Equivalence & Phonetic), then sends data to the
bucketing group, which creates the actual buckets and stores them in the MEMBKTD table.
Comparison:
The comparison function queries through MEMBKTD to locate a list of candidates to consider
for matching.
37
© 2014 IBM Corporation
Information Management
17281 = 9.1
Standardization: 2772 = 8.2
Data is pulled from the attribute, formatted for more 27189 = 5.1
efficient comparison, and stored in the MEMCMPD 392 = 2.1
table according to the role number.
2686 = 3.3
Bucketing:
The bucket function pulls data from MEMCMPD table, transforms it according to the bucketing
function design (As Is, Sorted, YYYYMM, Equivalence & Phonetic), then sends data to the
bucketing group, which creates the actual buckets and stores them in the MEMBKTD table.
Comparison:
The comparison function queries through MEMBKTD to locate a list of candidates to consider
for matching. The MEMCMPD entries for those candidates are them compared and a score is
issued for each record. The scores are sent to the entity manager to decided the next step or
they are sent to the search results to be displayed in descending order.
39
© 2014 IBM Corporation
Information Management
Click to pick up