Hybrid-Distance Monge Elkan

An Evolutionary Hybrid Distance for Duplicate String Matching
David Camacho
Ramon Huerta
Institute for NonLinear Science University of California, San Diego 9500 Gilman Drive, La Jolla, CA 92093-0402, USA
Charles Elkan
Department of Computer Science and Engineering University of California, San Diego 9500 Gilman Drive, La Jolla, CA 92093-0404, USA
Computer Science Department Universidad Autonoma de Madrid C/Francisco Tomas y Valiente, n. 11,28049 Madrid, Spain
rhuerta@ucsd.edu
david.camacho@uam.es ABSTRACT
This paper proposes a new hybrid string distance that combines edit-based and token-based string distances to match duplicate entries in separate databases. The parameters of the hybrid distance are optimized to maximize the number of correct matches by means of a genetic algorithm (GA). The hybrid distance evolved by the GA beats published results on the Zagat and Fodor restaurant data sets. In particular, the hybrid distance beats the support vector machine and Bayes methods. Then, the hybrid string distance is used to match two nancial databases to detect duplicates by the company name and ticker symbol. We show that 250 records of the training set are sufcient to reach the same level of success rate in the test and training set.
elkan@cs.ucsd.edu
The problem of record linkage was dened as the process of matching equivalent records that differ syntactically [24]. In Fellegi and Sunter [8] the entity matching issue is formulated as a classication problem, where the basic goal is to classify entity pairs as matching or non-matching, based on statistical unsupervised methods. Edit distances, such as Levenshtein [18] or Jaro-Winkler [34], have been primarily used to solve this problem in the past.1 In recent times the problem has received a boost by using methods imported from the machine learning eld. Recent work [2, 5, 16, 17] proposes important renements and improvements of classical Fellegi and Sunter approach. State-of-the art methods (Support Vector machines and Bayes classiers) have been applied very successfully to the problem and the GA performance has not yet been shown. The main goal of this work is to let a Genetic algorithm to design a new string distance using well-known string matching distances. The proposed distance uses a modication of Monge-Elkan method [20, 21] and several statistical contextbased parameters [5, 34], like the number of tokens in the string, the relative token position in the string, and the frequency of the tokens in a corpus. The evolutionary strategy is used to adapt the combination of distances and parameters that maximize the success of the matching process. The measure of the success is calculated by using the precision and recall measures [1]. Where precision is the fraction of the records retrieved that are really duplicates to the selected record. Whereas, recall represents the fraction of the records retrieved that are really duplicates. The GA requires labeled data sets where supervised learning can be applied. We deal with two data sets. One of them is the Zagat and Zodor restaurant data set where multiple methods have been applied. It is a good benchmark to determine whether it is possible to improve known performances. The second data set is a nancial data set. The Center for Research in Security Prices (CRSP) [6, 7] and the COMPUSTAT databases [25] are linked by a unique number. It is continu1 An edit distance is a metric that can be used to determine how closer are two strings. This value is obtained in terms of the number of character insertions, and deletions, needed to convert one into the other.
Categories and Subject Descriptors

H.3.2 [Information Storage]: Record classication ; H.3.3 [Information Search and Retrieval]: Selection process
Keywords
string distance, string matching, distance metric learning, information integration, genetic algorithms
1.
INTRODUCTION
Automatic detection of similar (or duplicate) entities is a critical problem in a wide range of applications from Web learning to protein sequence identication [8, 15, 24, 29]. The identication of similar entities or record linkage is necessary to merge databases with complementary information. There is a need to solve these problems algorithmically given the costs of using human expertise. Corresponding author.
ously maintained and updated because it is of critical importance to have them as clean as possible for nancial applications. The main advantage of this database for us is its size. There are more than 20K entries where multiple training and testing sets can be built. The rest of the paper is structured as follows. Section 2 describes some string matching and statistical-based distances. Section 3 provides a description of the evolutionary algorithm, and the string distance to be evolved. In Section 4 the cost function used by the Genetic Algorithm is described. Section 5 presents the databases used for both, training the genetic algorithm and evaluating the hybrid distance. This section provides the experimental comparison (using precision and recall measures) of our approach against some well known methods. Finally, in Section 6 the conclusions and future lines of this work are given.
a method for learning the weights for different string transformations. The identication of matching objects is then achieved by computing the similarity score between the attributes of the objects.
2.2
Statistical (token-based) distances
Other similarity metrics that are not based in edit-distance metric use an underlying statistical model based on the number and order of the common tokens between two strings, these metrics counts the common characters between two strings even if they are misplaced by a short distance. These token-based distances, such as Jaccard similarity, cosine distance (or TF-IDF), and Soft TF-IDF [4, 5] divide the two strings to be compared, s and t, into a sets of words (or tokens), then some kind of similarity metric is considered over these sets.
The TF-IDF (Term Frequency-Inverse Document Frequency) weight is a statistical measure used to evaluate how important a 2. RELATED WORK word is to a document in a collection or corpus. The TF-IDF Several general methods have been proposed to solve the weighting method is often used in the Vector Space Model problem of name matching, these have been applied to artogether with Cosine Similarity to determine the similarity eas like text and information retrieval [22, 33, 35], recordbetween two documents (this measure is widely used in both linkage [8], or information integration [21, 30]. Some of those Information Retrieval and Text Mining communities). Therewell known techniques have been proposed by Levenshtein [18], fore, the TF-IDF can be calculated in function of the frequency Jaro [12], Needleman-Wunsch [23], Smith-Waterman [28], Monge- of a particular token (w) in a the set (S). Elkan [20], and Winkler [34]. This section describes only the ones related with edit-based and statistical-based distances. Another interesting approach is the Soft TF-IDF [4, 5] that
2.1
String matching methods
There is a large number of methods that are based on the concept of edit distance metrics. Edit distance is dened for strings of arbitrary length and counts differences between strings in terms of the number of character insertions and deletions needed to convert one into the other. A number of these methods use the edit-distance at character level to deal with the variants of the names. If the distance between two names is less than a certain threshold, both are considered as aliases of the same entity. These methods are based on the idea that the same concepts are likely to be modeled using quite similar names. Examples of those metrics are Levenshtein [18]; Jaro metric [12, 13], and its variation by Winkler [34]; the Needleman-Wunsch distance [23], which assigns different costs on the edit operations; the Smith- Waterman [28], which additionally uses an alphabet mapping to costs; the Monge-Elkan[20, 21], which uses variable costs depending on the substring gaps between the words, this distance can be considered as an hybrid distance because can accommodate multiple static string distances using several parameters. These string matching, and other token-based (see next section) distances have been implemented by various Articial Intelligent methods to design adaptable metrics. For instance, MARLIN (Multiply Adaptive Record Linkage with INduction) [2, 3] system employs a two-level learning approach. In the rst stage a set of similarity metrics are trained to appropriately determine the similarity of different eld values. Later, a nal predicate for detecting duplicate records is learned from the similarity metrics applied to each of the individual elds. The MARLIN approach is based on a model for computing the string distance with afne gaps, which is applied via SVM to compute the vector-space similarity between strings. Other system, called Active Atlas [30, 31], uses
combines both string-based and token-based distances. In this approach similarity is affected for both, the tokens that appear in sets S and T and for those similar tokens in S that appear in T. Therefore a similarity distance sim (i.e. Jaro-Winkler) needs to be used to determine the set of tokens to be evaluated. This approach denes a special set CLOSE(, S, T ) as the set of words w S such there exist some word v T that sim (w, v) > . Therefore, the new set CLOSE allows to integrate a token based distance and the statistics of a particular corpus in the similarity evaluation of a particular word.
3.
HYBRID STRING DISTANCE AND SUPERVISED STATISTICAL-BASED ALGORITHM
Evolutionary Computation (EC) is a very active research eld in Articial Intelligence, the techniques proposed by EC are based to some degree on the evolution of biological life in the natural world. Genetic Algorithms [11], Evolution Strategies [27], Evolutionary Programming [9], or Genetic Programming [14] are some of the methodologies developed in the last few decades. For our specic goal Genetic Algorithms (GA) are suitable because of simple parameter coding and cost function design. The algorithm has two levels. The rst level determines the selection and combination of string matching and token-based distances that extract the necessary information that is fed into the second level. The databases usually have several elds for every record. All the distances across of the elds can be weighted to determine whether they are over or below a threshold. All the parameters involved are optimized by the GA 2 .
2 The GA was implemented using PGAPACK [19], http://www-fp.mcs.anl.gov/CCST/ research/reports_pre1998/comp_bio/ stalk/pgapack.html
We will build the distance at two levels. Level 0 builds the evolved distance on a single eld of a record in the database. The level 1 combines the output distance values from each of the elds to form a meta-distance.
Table 1: Summary of the hybrid string distance and meta-distance

parameters.
3.1
Level 0: string distance function
We implemented a general distance that can smoothly integrate the string matching methods with the statistical ones. Many variations and improvements of this distance could be designed as long as a computationally feasible cost function is evaluated. A given eld of a database, xi , can be split on i a set words wj , j = 1, . . . , Ni . For example, xi = microsoft corporation is split into w1 = microsoft, w2 = corporation with Ni = 2. Thus, we create a basket of words over which the edit-based distances can be applied. It can be expressed mathematically as
Parameter + + dtype [0..4] m + +
Distance (xi , xj ) (xi , xj ) (xi , xj ) -distance -distance
Any record in a database has several elds. It is required to have a method that can account for this problem. Therefore, we need to build a meta-distance on top of the Level 0 distance.
3.2
version 0 (xi , xj ) =
N X k=1 i j dtype (wk , wk )
Level 1: weighted average over all the elds
where N = min{Ni , Nj } and the variable type can be either the Levenshtein [18], Jaro [12], Needleman-Wunsch [23], Smith - Waterman [28], or Winkler [34] distance. Next, we integrate the Monge-Elkan concept to make sure that there is no mismatch due to bad ordering of the sequence of words we apply a permutation operator, Pl , where l denotes the permutation number. The permutation set can be built with any desired restrictions. For instance, permutations of the nearest words is a reasonable approach. The permutation operator induces a reordering of the words in xi . In our example, wP1 (1) = microsoft, wP1 (2) = corporation; and wP2 (1) = corporation, wP2 (2) = microsoft. Thus we can generalize the distance as
In a general problem (using two different databases and comparing a particular string that identies the record), every entry in the database may have several elds. For instance, in the COMPUSTAT/CRSP merged database 3 we have a ticker and a company name. In the Restaurant database [4, 16] we nd the name of the restaurant, the address, the city, the telephone number, and the state. Thus we can write that a given entry Fi is made of N F elds as (xi (0), xi (1), . . . , xi (N F )), and the nal pondered meta-distance can be written as 1 (Fi , Fj ) = PN F
NF X
m=1
m (xi (m), xj (m)),
m=1
the GA assigns a weight, m , to every eld in a particular record. For instance, in the Restaurant database higher weights could be assigned to the rst eld (name of the restaurant) and lower to other elds such as address or state, if these distributions of weights allow better detection of duplicates.
( version 1 (xi , xj ) = min

l
N X
) i j dtype wk , wPl (k) ,
3.3
Denition of match
(Fi , Fj ) < .
Using our Level 0 and 1 distances we classify two entries as a match if:
k=1
where l runs through all the set of permutations that we want to introduce. Finally, we include the concept behind the statistical distances [5] to build the nal expression of the distance
( (xi , xj ) = min
l
N X
Otherwise they are different. In summary, the following parameters will be optimized: (,) to convey the statistics to the data set, and dtype which is a parameter takes integer values to select a particular string distance. The Table 1 shows a summary of the parameters that are used, and their relationship with the mathematical expressions.
i j dtype wk , wPl (k) i f (wk ) 0 log @ + j f wPl (k)
(1) 19 = A (2) , ;
k=1
4.
COST FUNCTION
log +
i where f (wk ) is the frequency of the word in the corpus, therein the statistics. , and are scaling parameters, which allows to the GA algorithm to adapt the value of the score function to the context statistics. For instance, if = 1 and =0 no statistical information will be used and the distance behavior will correspond to a static string distance.
For each pair of entities (strings) the distance is calculated. Using this distance we determine whether it is below the threshold, , to count it as positive match. From all these we determine how many false positives (or f p, non duplicated entities that were wrongly classied), and true positives (or tp, true duplicated entities) have been retrieved. The global value to be maximized is the F-measure, that is, the weighted harmonic mean of classical Information Retrieval Precision (P ) and Recall (R) values [1, 33].
3
http://www.crsp.com/products/ccm.htm
tp , (tp + f p) tp , R(wi , wj , ) = Ntotal 2P R Fmeasure (wi , wj , ) = , P +R P (wi , wj , ) = where Ntotal is the total number of existing relevant records.
Table 2: Performance comparison of different methods using the

Zagat and Fodor restaurants data sets. (*) results have been obtained from [16]
The Precision, P (wi , wj , ), is the fraction of the records re
trieved that are correct. Whereas, Recall, R(wi , wj , ) represents the fraction of the records retrieved from all the possible ones. When there is perfect retrieval Fmeasure (wi , wj , ) = 1, and a total failure leads to Fmeasure = 0.
Distance Jaro NeedlWunsch Levenshtein HGM (SoftTFIDF)(*) JaroWinkler Bayes independent(*) Bayes dependent(*) Support Vector Machines (*) -distance
Training set size [331] [331] [331] [331] [331] [331] [331] [331] [331]
F-measure [0.71616] [0.712] [0.740] [0.844] [0.921] [0.951] [0.955] [0.971] [0.982]
5.
DATA SETS AND EXPERIMENTAL EVALUATION
In the name matching problem several data domains are usually managed like; computational biology (DNA patterns, proteins identication,. . . ), text retrieval, nancial [35], or ontology alignment [10, 32]. Designing a supervised learning algorithm needs from databases previously evaluated by human intervention. The rst data set considered (Zagat and Fodor Restaurant guide) because it is widely used in the Information Retrieval (IR), and Information Integration (II) literature [2, 4, 16, 26]. The second data set is a nancial data set (CRSP / Compustat merged database) that is proposed here for the rst time and is available for public testing on inls.ucsd.edu/ huerta.
and the optimization procedure. The question that remains is how much overtting is taking place. In all the previous work this issue was not addressed due to the limited sample size. One of the best databases to test the level of overtting with this method is to use the nancial databases which have a big size and they can be trained by supervised learning. This is what we address in the next section.
5.2
Results of the Financial Data Bases
5.1
Results on the Restaurant Database
The restaurant data set from RIDDLE (which includes integrated, and cleaned, information from Zagat and Fodor Web sites) was divided into two different data bases (Zagat and Fodor). The restaurant data set contains 864 restaurant name, addresses, telephone numbers. 112 of those records are duplicated (with 533 and 331 non-duplicate restaurants in Fodors and Zagats guides). Every record has ve elds of the following form Record (RIDDLE, Restaurant data set):@data "arnie mortons of chicago", "435 s. la cienega blv.", "los angeles", "310/246-1501", "american", 0 "arnie mortons of chicago", "435 s. la cienega blvd.", "los angeles", "310-246-1501", "steakhouses", 0 Each of these records are processed into a set of single strings that contains ve different elds, that is, [N F = 5] and [xi (1), xi (2), xi (3), xi (4), xi (5)] [restaurant name, address, city, telephone, type]. The results showed in Table 2 comprised information extracted from [16], which are labeled with asterisks and our proposed -distance evolved with GA. These results were obtained by Lehti & Fankhauser using the same set of training set (Restaurants) and several unsupervised (HGM) and supervised (Bayes, SVM) algorithms were tested. The GA adjustment of the gives F = 0.982 that is better than all the published results to this date. The main reason why we obtain better results probably lies in the exibility of the -distance
The Financial data domain is a good testbed because of the large number of entries in the databases. The number of databases available are increasing with time due to the advent of the nancial web services. We will focus on the Center for Research in Security Prices (CRSP) [6, 7] and the COMPUSTAT databases [25], which are the two most popular databases in the nancial eld. The CRSP contains historical data of the prices of individual stocks, while COMPUSTAT provides the fundamentals of a company. These two databases are merged into a single one called CRSP/COMPUSTAT database. The main advantage of having CRSP and CRSP/COMPUSTAT is that these two databases are linked by common tag names which are named the permanent numbers. This permanent number allows us to build a training and testing set to pair entries. In this particular data set we used N F = 2. The rst eld is the ticker, for example, IBM, AAPL, GOOG; and the second eld is the company name, for example, International Business Machines, Apple Computers, Google. The overall results are shown in Table 3. 7 different training groups of duplicates were built, and 2000 non-overlapping examples were used for testing 4 . The most important conclusion is the ability to train the distance with a relatively low number of data entries without overtting. In fact, we can say that 250 entries are enough to obtain a good performance in the testing set. The solution for 1000 entries case is = 63.136, = 300.821, = 1, 710, 560, type = Jaro distance, 1 = 0.088, and 2 = 0.855. The F-score is 0.85. It is interesting that the ticker symbol (1 = 0.088) has a substantial lower weight than the company name, because the tickers are the common way to refer to stocks. The fact that the system can get good performance with not many examples let us speculate that, given the nature of the string distance problem, the parameters of the distances are not so
4
The data sets are available at inls.ucsd.edu/ huerta
Table 3: Results using the -distance for different number of

training elements using the CRSP-Compustat merged database. The data set is available at inls.ucsd.edu/ huerta .
# records in training 10 25 50 100 250 500 1000
F-Training 1 0.896 0.896 0.876 0.857 0.867 0.855615
F-Testing 0.769 0.780 0.829 0.822 0.855 0.849 0.849
susceptible for overtting. Even for 10 record training the test results are acceptable with a F-score of 0.769, type = Jaro, and 1 = 0.156, and 2 = 0.722; which are not far from the solution using 1000 training examples.
6.
CONCLUSIONS
We propose a exible string distance that combines and selects a pool of distance by a GA. The new hybrid distance has been tested in two data sets. The rst one is a data set based on the Zagat and Fodor restaurant guides. It has been used to compare with other classical algorithms such as Bayes or SVM classiers. for this case, the hybrid distance obtains the best accuracy. The advantage might lie of the exibility of our -distance adapted by the GA. The GA itself is a tool. We would expect that, for instance, a simulating annealing technique could achieve exactly the results. Here we propose to use a nancial data domain as a new testbed for all the state-of-the-art methods. At this point in time we leave the comparison with other methods to future work or other researches. For the CRSP/COMPUTSTAT test our hybrid distance achieves an accuracy level of 0.95. The database itself contains some inaccuracies that can impede not only reaching perfect accuracy but the learning process itself. It is remarkable that 250 entries in the training set already achieve as good performance as the -distance trained with 1000 examples. This is a promising result in favor of the unsupervised algorithms, because the result indicates that there is some good underlying statistical structure provided by the basic distances like the Jaro distance that was the preferred choice of the GA.
7.
ACKNOWLEDGMENTS
This work has been funded by Jose Castillejo (JC2007-00063) grant from Spanish Ministry of Science.
8.
REFERENCES
[1] R. Baeza-Yates and B. Ribiero-Neto. Modern Information Retrieval. Addison Wesley, 1999. [2] M. Bilenko and R. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical report, Articial Intelligence Lab, University of Texas at Austin, 2002. [3] M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 3948, 2003.
[4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):1623, Sep/Oct 2003. [5] W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of International Joint Conference on Articial Intelligence, pages 7378, 2003. [6] D. P. Cram. Crsp data retrieval and analysis in sas software for users and administrators. Technical Report 80, Stanford University Graduate School of Business, 1996. [7] D. P. Cram. Crsp data retrieval and analysis in sas software: Sample programs and programming tips. In Proceedings of the Twenty-First Annual SAS Users Group International Conference, 1996. [8] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):11831210, 1969. [9] L. Fogel, A. Owens, and M. Walsh. Articial Intelligence through Simulated Evolution. John Wiley, 1966. [10] B. B. Hariri and H. Abolhassani. A new evaluation method for ontology alignment measures. In The Semantic Web ASWC 2006. Lecture Notes in Computer Science, pages 249255. Springer Verlag, 2006. [11] J. H. Holland. Adaptation in Natural and Articial Systems. University of Michigan Press, Ann Arbor, 1975. [12] M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, orida. Journal of America Statistical Association, 84(406):414420, June 1989. [13] M. Jaro. Probabilistic linkage of large public health data les. Statistics in Medicine, 14(5-7):MarchApril, 491-498 1995. [14] J. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. [15] A. J. Lait and B. Randell. An assessment of name matching algorithms. Technical report, Department of Computer Science, University of Newcastle upon Tyne, UK, 1993. [16] P. Lehti and P. Fankhauser. Probabilistic iterative duplicate detection. In On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE, volume 3761/2005 of Lecture Notes in Computer Science, pages 12251242, 2005. [17] P. Lehti and P. Fankhauser. Unsupervised duplicate detection using sample non-duplicates. Journal of Data Semantics, VII:136164, 2006. [18] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707710, 1966. [19] D. Levine. Pgapack parallel genetic algorithm library. Technical report, Mathematics and Computer Science Division at Argonne National Laboratory, 2000. [20] A. Monge and C. Elkan. The eld-matching problem: Algorithm and applications. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 267270. AAAI Press, 1996.
[21] A. Monge and C. Elkan. An efcient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop Data Mining and Knowledge Discovery, pages 267270. ACM, 1997. [22] G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):3188, 2001. [23] S. Needleman and C. Wunsch. A general method applicable to search for similarities in the amino acid sequence of two proteins. Molecular Biology, 48:443458, 1970. [24] H. Newcombe, J. Kennedy, S. J. Axford, and A.P.James. Automatic linkage of vital records. Science, 130:954959, October 1959. [25] P. J. Ratnaraj and C. M. Katzman. Managing large nancial data with ease: Crsp, compustat, etc., with sas. In Proceedings of the Nineteenth Annual SAS Users Group International Conference, 1994. [26] P. Ravikumar and W. Cohen. A hierarchical graphical model for record linkage. In Proceedings of the 20th Conference on Uncertainty in Articial Intelligence (UAI04), pages 454461. UAI Press, 2004. [27] I. Rechenberg. Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. PhD thesis, Berlin Technical University, 1973. [28] T. Smith and M. Waterman. Identication of common molecular subsequences. Molecular Biology, 147:195197, 1981. [29] C. Snae. A comparison and analysis of name matching algorithms. In WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, volume 21, pages 25257, 2007. [30] S. Tejada, C. A. Knoblock, and S. Minton. Learning object identication rules for information integration. Information Systems, 28(8):607635, 2001. [31] S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identication. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 350359, 2002. [32] A. G. Valarakos, G. Paliouras, V. Karkaletsis, and G. Vouros. A name-matching algorithm for supporting ontology enrichment. In Methods and Applications of Articial Intelligence. Lecture Notes in Computer Science, pages 381389, 2004. [33] C. J. van Rijsbergen. Information Retrieval. London:Butterworths, 1979. [34] W. Winkler. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service, 1999. Publication R99/04. [35] F. Xu, D. Kurz, J. Piskorski, and S. Schmeier. Term extraction and mining of term relations from unrestricted texts in the nancial domain. In W. Abramowicz, editor, BUSINESS INFORMATION SYSTEMS BIS, 2002.

Hybrid-Distance Monge Elkan

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hybrid-Distance Monge Elkan

Hochgeladen von

Copyright:

Verfügbare Formate

An Evolutionary Hybrid Distance for Duplicate String Matching

Categories and Subject Descriptors

Statistical (token-based) distances

String matching methods

HYBRID STRING DISTANCE AND SUPERVISED STATISTICAL-BASED ALGORITHM

Table 1: Summary of the hybrid string distance and meta-distance

Level 0: string distance function

Parameter + + dtype [0..4] m + +

Distance (xi , xj ) (xi , xj ) (xi , xj ) -distance -distance

Level 1: weighted average over all the elds

m (xi (m), xj (m)),

( version 1 (xi , xj ) = min

) i j dtype wk , wPl (k) ,

i j dtype wk , wPl (k) i f (wk ) 0 log @ + j f wPl (k)

Table 2: Performance comparison of different methods using the

The Precision, P (wi , wj , ), is the fraction of the records re

DATA SETS AND EXPERIMENTAL EVALUATION

Results of the Financial Data Bases

Results on the Restaurant Database

The data sets are available at inls.ucsd.edu/ huerta

Table 3: Results using the -distance for different number of

# records in training 10 25 50 100 250 500 1000

F-Training 1 0.896 0.896 0.876 0.857 0.867 0.855615

F-Testing 0.769 0.780 0.829 0.822 0.855 0.849 0.849

Das könnte Ihnen auch gefallen