Beruflich Dokumente
Kultur Dokumente
1
J. Jebamalar Tamilselvi and 2V. Saravanan
1
PhD Research Scholar, Department of Computer Application, Karunya University,
Coimbatore – 641 114, Tamilnadu, INDIA
E-mail: jjebamalar@gmail.com
2
Professor & HOD, Department of Computer Application, Karunya University,
Coimbatore – 641 114, Tamilnadu, INDIA
E-mail: saravanan@karunya.edu
Abstract
Finally, all the cleaned records are grouped or merged and made available for
the next process.
This research work will be efficient for reducing the number of false
positives without missing out on detecting duplicates. To compare this new
framework with previous approaches the token concept is included to speed up
the data cleaning process and reduce the complexity. Analysis of several
blocking key is made to select best blocking key to bring similar records
together through extensive experiments to avoid comparing all pairs of
records. A rule based approach is used to identify exact and inexact duplicates
and to eliminate duplicates.
Introduction
In the 1990's as organizations of scale began to need more timely data for their
business, they found that traditional information systems technology was simply too
cumbersome to provide relevant data efficiently and quickly. Completing reporting
requests could take days or weeks using antiquated reporting tools that were designed
more or less to 'execute' the business rather than 'run' the business.
A data warehouse is basically a database and having unintentional duplication of
records created from the millions of data from other sources can hardly be avoided. In
the data warehousing community, the task of finding duplicated records within data
warehouse has long been a persistent problem and has become an area of active
research. There have been many research undertakings to address the problems of
data duplication caused by duplicate contamination of data.
There are two issues to be considered for duplicate detection: Accuracy and
Speed. The measure of accuracy in duplicate detection depends on the number of false
negatives (duplicates that were not classified as such) and false positives (non-
duplicates which were classified as duplicates). The algorithm’s speed is mainly
affected by the number of records compared, and how costly these comparisons are.
Generally CPUs are not able to do duplicate detection on large databases within any
reasonable time, so normally the number of record comparison needs to be cut down
[4].
In this research work, a framework is developed to handle any duplicate data in a
data warehouse. The main objective of this research work is to improve data quality
and increase speed of the data cleaning process. A high quality, scalable blocking
algorithm, similarity computation algorithm and duplicate elimination algorithm are
used and evaluated on real datasets from an operational data warehouse to achieve the
objective.
Framework Design
A sequential based framework is developed for detection and elimination of duplicate
data. This framework comprises of some existing data cleaning approaches and new
approaches which are used to reduce the complexity of duplicate data detection and
elimination and to clean with more flexibility and less effort.
Detection and Elimination of Duplicate Data Using Token-Based Method 147
Fig. 1 shows the framework to clean the duplicate data in a sequential order. Each
step of the framework is well suited for the different purposes. This framework will
work according to the data by using software agent in each step with less user
interaction.
The principle on this framework is as follows:
A. Selection of attributes: There is a clear need to identify and select attributes.
These selected attributes are to be used in the other steps.
B. Formation of tokens: The well suited token is created to check the
similarities between records as well as fields.
C. Clustering/Blocking of records: Clustering algorithm or blocking method is
used to group the records based on the similarities of block token key value.
D. Similarity computation for selected attributes: Jaccard similarity method is
used for token-based similarity computation.
E. Detection and elimination of duplicate records: The rule based detection
and elimination approach is used for detecting and eliminating the duplicates.
F. Merge: The result or cleaned data is merged.
A. Selection of attributes
The data cleaning process is complex with the large amount of data in the data
warehouse. The attribute selection is very important to reduce the time and effort for
the further work such as record similarity and elimination process etc. Attribute
selection is very important when comparing two records [5]. This step is the
foundation step for all the remaining steps. Therefore time and effort are two
important requirements to promptly and qualitatively select the attribute to be
considered.
A B C
New
Data i
E
Selection of
Attributes Forming Clustering / Cleaned
ii
Tokens Blocking Data
Algorithm
…
Maintaining LOG n
Table Similarity Elimination Merge
Data computation for
Warehouse selected attributes
using selected
functions
Bank of
Maintaining elimination
….. similarity LOG F
i ii iii n functions
D Tables
∑Ni=0 accuracy i, j
b. Accuracyj =
n
∑Ni=0 consistency i, j
c. Consistencyj =
n
Evaluation of Attributes
B. Formation of Tokens
This step makes use of the selected attribute field values to form a token. The tokens
can be created for a single attribute field value or for combined attributes. For
example, contact name attribute is selected to create a token for further cleaning
process.The contact name attribute are split as first name, middle name and last name.
Here first name and last name are combined as contact name to form a token.
Tokens are formed using numeric values, alphanumeric values and alphabetic
values by selecting some combination of characters. Unimportant elements are
removed before the token formation [title tokens like Mr., Dr. and so on [6].
Numeric tokens comprise only digits [0 – 9]. Alphabetic tokens consist of
alphabets (aA - zZ). The first character of each word in the field is considered and the
characters are sorted. Alphanumeric tokens comprise of both numeric and alphabetic
tokens. It composes a given alphanumeric element into numeric [7].
This step eliminates the need to use the entire string records with multiple passes,
for duplicate identification. It also, solves similarity computation problem in a large
database by forming token key from some selected fields, to reduce the number of
comparisons.
Fig 4 shows an algorithm for token formation. In this algorithm, rules are
specified to form the tokens. This algorithm works according to the type of the data.
For example, if address attribute is selected, alphanumeric rule is used to form the
tokens. The formed tokens are stored in LOG table.
The idea behind this algorithm is to define smart tokens from fields of selected
multiple most important attributes by applying simple rules for defining numeric,
alphabetic, and alphanumeric tokens. Temporary table now consists of smart token
Detection and Elimination of Duplicate Data Using Token-Based Method 151
records, composed from field tokens of the records. These smart token records are
sorted out using block-token-key.
C. Clustering/Blocking of records
Data mining primarily works with large databases. Sorting the large datasets and data
duplicate elimination process with this large database faces the scalability problems.
The clustering techniques are used to cluster or group the dataset into small groups
based on the distance values or some threshold values to reduce the time for the
elimination process.
The blocking methods are used for reducing huge number of comparisons. This
step is used for grouping the records that are most likely to be duplicated based on the
similarity of block-token-key. It works in a blocking fashion. i.e Block-token key is
used to split the data sets into blocks. Block-token key is very important in blocking
the records. There are four types of block-token-key generation used and identified
good block-token-key for blocking. They are i) Blocking with Single Attribute ii)
Blocking with Multiple Attributes iii) Array based Block-Token-Key and iv) Token
based Blocking key. The choice of a good block-token-key can greatly reduce the
152 J. Jebamalar Tamilselvi and V. Saravanan
number of record pair evaluations to be performed and so the user can achieve
significant result.
←
Output: Blocked records
Var: n no. of records, block b[ ], i, j
begin
1. Sort database using Block-Token-Key
2. Initialize new block
3. Blocking Records
i. Single Attribute
for each record i = 0 to n
for each key j = 1 to kn
for each block k = 1 to bn
if distance (key[j], key[j + 1]) > threshold
add key[j] to block b[k]
else
add key[j + 1] to b1
initialize new block b[k + 1]
end if
j=j+1
end for
end
ii. Multiple Attribute
for each record i = 0 to rn
for each column j = 0 to cn
for each key k = 1 to kn
for each block b = 1 to bn
Calculates distance of all selected column key values d(values)
if distance (d[values]) > threshold
add key[j] to block b[k]
else
add key[j + 1] to b1
initialize new block b[k + 1]
end if
j=j+1
end for
end
This algorithm is efficient to identify exact and in exact duplicate records based
on the selection of block-token key. The number of comparison is reduced when
compared with other existing methods and increases the speed of the data cleaning
process. This blocking method gives better performance than other methods like
shrinking or expanding the window size based on the block-token key.
F. Merge
This step merges the corrected data as a single cluster [10],[11]. The user must
maintain the merged record and the prime representative as a separate file in the data
warehouse. This information helps the user for further changes in the duplicate
elimination process. This merge step is useful for the incremental data cleaning. When
a new data enters into the data warehouse, incremental data cleaning checks the new
data with the already created LOG file. Hence, this reduces the time for the data
cleaning process.
Experimental Results
An attribute selection algorithm is implemented to select more important attributes
which is having enough information for identifying duplicate records. Selected
attributes are used for duplicate record detection. The selected attributes have enough
information for duplicate data detection. The results are drawn below with different
attribute values, numbers of duplicate records detection and token formation..
154 J. Jebamalar Tamilselvi and V. Saravanan
120
100
80
60 Field Size
Missing Value
40
Distinct Value
20
Mesurement Type
0
Attribute Vs Duplicates
The identification of duplicates is mainly based on the selection of attributes and
selection of window size. In the existing methods, fixed size sliding window is used to
minimize the number of comparison. In this method, dynamic window size is used
based on the similarities of field values. The best accuracy of the duplicate detection
is obtained by using our dynamic method. Fig. 7 shows how the number of duplicate
records detected varies as the sliding window size changes and for dynamic size of
window. The result of duplicate detection is varied based on the selection of window
size and dynamic size. To test this phenomenon, results are taken by varying the
attribute values for each execution setting window size between 10 and 50 and
dynamic size.
Detection and Elimination of Duplicate Data Using Token-Based Method 155
120
100
Duplicate Detected
80
ADD_ADDRESS1
60 ADD_NAME
ADD_PHONE1
40 ADD_ADDRESS2
ADD_CDATE
20 ADD_PINCODE
0
Ws 2 Ws 10 Ws 20 Ws 30 Ws 40 Ws 50 Dyanamic
Window Size
70000
60988 58944
60000
No. of duplicatesdetected
50860
50000 45128
40156 39160
40000
33172
30000
20000
10000
0
1 2 3 4 5 6 7
No. of keys columns
No. of
Key Columns Selected No. of duplicate detected
Columns
1 ADD_ADDRESS1 60988
2 ADD_ADDRESS1 ADD_NAME 58944
3 ADD_ADDRESS1 ADD_NAME 50860
ADD_PHONE1
4 ADD_ADDRESS1 ADD_NAME 45128
ADD_PHONE1 ADD_CDATE
5 ADD_ADDRESS1 ADD_NAME 40156
ADD_PHONE1 ADD_CDATE ADD_DEL
6 ADD_ADDRESS1 ADD_NAME 39160
ADD_PHONE1 ADD_CDATE ADD_DEL
ADD_PARENTTYPE
7 ADD_ADDRESS1 ADD_NAME 33172
ADD_PHONE1 ADD_CDATE ADD_DEL
ADD_PARENTTYPE ADD_PINCODE
90
80
70
Time (Seconds)
60
50
Token_Creation
40
30
Table Definition &
20
Attribute Selection
10
0
Figure 9: Time taken Vs token formation, attribute selection with different data size.
that variance of time between token based similarities and multiple attribute
similarities with different database size. The speed of the data similarity computation
is increased in Token-based similarity computation. From above fig. 10, as the size of
the data increases, the time taken for similarity computation also increases. As a
result, the time taken for the similarity computation is reduced when comparing token
based similarity computation with multi-similarity computation.
2500
2000
Time (Seconds)
1500
Token_Similarity
1000
Multi_Similarity
500
118953
1600000
1400000
707398
1200000
1000000
Phone
800000 Address2
Name
600000 Address1
778231
400000 189623
83752
200000 73425 98734
66479 52346
45923 36457 51043
14275 36315 148737
12472 63841 85353 103891
44075
0
100000 200000 300000 400000 500000
Duplicates Vs Window
The fig. 12 shows relationship between identification of duplicates, window size and
attributes. In this figure 5h, size of each window and number of windows are varied
for each attributes and also numbers of identified duplicates are varied with each
attribute. Accurate results are obtained using attributes address1 and phone. Theses
two attributes are having enough information that is high distinct values for duplicate
identification. Attributes name and address2 have less number of unique values than
other attributes. But, wrongly identified duplicates are high with these two attributes.
Finally, as a result, smaller blocks result in less comparison but match pairs are
missed. Pair completeness results improve with larger windows. Figure 5h shows the
identification of duplicates as being varied based on the size of the window. As the
duplicates increase the size of the window also increases for different attributes.
25000
1, 20193 2, 20206 3, 20369 4, 19080
20000
15000
10000 4, 9647
1, 8384
5000 Duplicates
3, 3647 Window
0 2, 3635
0 1 2 3 4 5
1 - Address1 2 - Name
3 - Address 2 4 - Phone
taken for duplicate data detection and elimination processes are analyzed to evaluate
the efficiency of time saved in this research work.
0.9
Array Based Block-Token-Key
0.8
Blocking with Single Attribute
0.7
Blocking with Multiple
0.6 Attribute
Token Based Blocking Key
0.5
0 200000 400000 600000 800000 1000000
Figure 13: Pair completeness for blocking keys and dynamic window.
70000
No. of Duplicate Records
60000
50000
40000
30000
20000
10000
0
0.5 0.6 0.7 0.8 0.9 1
Threshold Value
0.1
Window Size
140
120
100 Table Definition
Time(Seconds)
Token Creation
80
Blocking
60
Similarity Computation
40
Duplicate Detection
20
Duplicate Elimination
0
Total Time
50,000 75,000 1,00,000 1,25,000 1,50,000
Database(Size)
increase linearly as the size of the databases increase is independent of the duplicate
factor.
140
120
100
80 10%duplicates
60 30%duplicates
40 50%duplicates
20
0
50,000 75,000 100,000 1,25,000 1,50,000
Conclusion
Deduplication and data linkage are important tasks in the pre-processing step for
many data mining projects. It is important to improve data quality before data is
loaded into data warehouse. Locating approximate duplicates in large databases is an
important part of data management and plays a critical role in the data cleaning
process. In this research wok, a framework is designed to clean duplicate data for
improving data quality and also to support any subject oriented data. This framework
is useful to develop a powerful data cleaning tool by using the existing data cleaning
techniques in a sequential order.
A new attribute selection algorithm is implemented and evaluated through
extensive experiments. The attribute selection algorithm can eliminate both irrelevant
and redundant attributes and is applicable to any type of data (nominal, numeric, etc.).
Also, this attribute selection algorithm can handle data of different attribute types
smoothly. The quality of the algorithm results are confirmed by applying a set of rule.
The main purpose of this attribute selection for data cleaning is to reduce the time for
the further data cleaning process such as token formation, record similarity and
elimination process in an efficient way.
The token formation algorithm is used to form smart tokens for data cleaning and
it is suitable for numeric, alphanumeric and alphabetic data. There are three different
rules described for the numeric, alphabetic, and alphanumeric tokens. The result of
the token based data cleaning is to remove duplicate data in an efficient way. The time
will be reduced by the selection of attributes and by the token based approach. These
formed tokens are stored in the LOG Table. The time required to compare entire
string is more than comparison of tokens. This formed token will be used as the
blocking key in the further data cleaning process. So, the token formation is very
important to define best and smart token.
Using of an unsuitable key, which is not able to group the duplicates together, has
a deterring effect on the result, i.e. many false duplicates are detected in comparison
Detection and Elimination of Duplicate Data Using Token-Based Method 163
with the true duplicates, using say, the address key. Hence, key creation and selection
of attributes are important in the blocking method to group similar records together.
The selection of the most suitable blocking key (parameter) for the blocking method
is addressed in this research work. Dynamically adjusting the blocking key for the
blocking method will be effective in record linkage algorithms during the execution
time. The blocking key is selected based on the type of the data and usage of the data
in the data warehouse. The dynamically adjusting blocking key and token based
blocking key as well as the dynamic window size SNM method is used in this
research work. An agent is used in tuning parameter or everything is set dynamically
for the blocking method without human intervention to yield better performance.
However, in most real world problems where expert knowledge is hard to obtain, it is
helpful to have methods that can automatically choose reasonable parameters for us.
Time is critical in cleansing large database. In this research work, efficient token
based blocking method and similarity computation method is used to reduce the time
taken on each comparison. In this research work, efficient duplicate detection and
duplicate elimination approach is developed to obtain good result of duplicate
detection and elimination by reducing false positives. Performance of this research
work shows that there was significant time saving and improved duplicate results than
the existing approach.
To compare this new framework with previous approaches the token concept is
included to speed up the data cleaning process and reduce the complexity. Each step
of this new framework is specified clearly in a sequential order by means of the six
data cleaning process offered such as attribute selection, token formation, clustering,
similarity computation, elimination, and merge. An agent is used in this framework to
reduce the effort taken by the user. This agent will work according to the type and
size of the data set. This framework is flexible for all kinds of data in the relational
databases.
The framework is mainly developed to increase the speed of the duplicate data
detection and elimination process and to increase the quality of the data by identifying
true duplicates and strict enough to keep out false-positives. The accuracy and
efficiency of duplicate elimination strategies are improved by introducing the concept
of a certainty factor for a rule. Data cleansing is a complex and challenging problem.
This rule-based strategy helps to manage the complexity, but does not remove that
complexity. This approach can be applied to any subject oriented databases in any
domain. This proposed research work maintains LOG files with all the cleaning
process for the incremental data cleaning.
References
[1] Dorian Pyle, Data Preparation for Data Mining, Published by Morgan
Kaufmann, ISBN 1558605290, 9781558605299, 540 pages, 1999.
[2] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques,
Publisher: Elsevier Science & Technology Books, ISBN-13:9781558609013,
March 2006.
164 J. Jebamalar Tamilselvi and V. Saravanan