Sie sind auf Seite 1von 10

A Supervised Machine Learning Approach for Gazetteer

Record Linkage

ABSTRACT with digital gazetteers is record linkage, i.e. detecting ex-


This paper presents an approach for detecting duplicate act and near duplicates so that place identity is preserved.
records in the context of digital gazetteers, using state-of- Although record linkage techniques have been extensively
the-art machine learning techniques. It reports a thorough studied [22], their application to complex data records like
evaluation of alternative machine learning approaches de- in the case of gazetteers (i.e., containing not only textual
signed for the task of classifying pairs of gazetteer records attributes but also geospatial footprints, name and category
as either duplicates or not, built by using support vector multi-sets and hierarchically organized categorical informa-
machines or alternating decision trees with different combi- tion), has received less attention in the literature.
nations of similarity scores for the feature vectors. Experi-
mental results show that using feature vectors that combine By formulating record linkage as a classification problem,
multiple similarity scores, derived from place names, seman- where the goal is to classify record pairs as duplicates or non-
tic relationships, place types and geospatial footprints, leads duplicates, this paper argues that the challenge of gazetteer
to an increase in accuracy. The paper also discusses how the record linkage can be met by using machine learning, more
proposed categorization approach can scale to large collec- specifically through supervised classification and feature vec-
tions, through the usage of filtering techniques. tors combining multiple sources of similarity (e.g. similarity
between geospatial footprints, type categories, semantic re-
Categories and Subject Descriptors lations and place names). The paper reports a thorough
H.3.1 [Content Analysis and Indexing]: Abstracting evaluation of several machine learning approaches designed
methods; H.4.m [Information Systems]: [Miscellaneous] for the task, from which the following main conclusions can
be reached:
General Terms
Algorithms, experimentation • Both support vector machines (SVMs) and alternating
decision tree classifiers are adequate to the task, with
Keywords decision trees performing slightly better.
Record Linkage, Machine Learning, Gazetteers • The combinations of different types of similarity fea-
tures leads to an increase in accuracy, although the
1. INTRODUCTION similarity between place names alone provides a com-
Digital gazetteers are geospatial dictionaries for named and petitive baseline.
typed places that exist in the surface of the Earth [10]. Their
essential utility is to translate between formal and infor- • Similarity scores between place names are the most in-
mal systems of place referencing, i.e. between the ad hoc formative feature for discriminating between duplicate
names and qualitative type classifications assigned to places and non-duplicate record pairs.
for human consumption, on the one hand, and quantitative
locations (e.g., geospatial coordinates) that support the au- The rest of the paper is organized as follows: Section 2
tomated processing of place references, on the other [11]. presents related work, describing digital gazetteers and ap-
proaches for record linkage. Section 3 describes the pro-
Frequently, digital gazetteers are built from the consolida- posed approach based on supervised classification, detailing
tion of multiple data sources. Thus, a fundamental challenge the classification techniques and the proposed similarity fea-
tures. Section 4 presents the experimental validation for the
proposed approach. Finally, Section 5 presents the main
conclusions and directions for future work.

2. RELATED WORK
On the one hand, the task of record linkage, aka dupli-
cate detection, has been explored by a number of commu-
nities, including statistics, databases, and artificial intelli-
gence. In this paper, the term record linkage is used to
refer to the problem of detecting duplicate records when in- fined an extensive taxonomy of place types, known as the
tegrating data that comes from multiple databases. On the Feature Type Thesaurus (FTT). The current version of the
other hand, the area of Geographic Information Systems, FTT has a total of 1156 place type classes (i.e., terms), of
including the sub-area of Geographic Information Retrieval which 210 are preferred terms and 946 are non-preferred
(GIR), has addressed issues related to the management of terms that are related to the preferred terms. A proto-
gazetteer information. This section surveys relevant past re- type system describing approximately 5 million U.S. do-
search, both in terms of gazetteer data management and in mestic and international places, using the content standard
database record linkage. and the FTT, has been deployed by combining the U.S.
place names from the Geographic Names Information Sys-
tem (GNIS) and the non-U.S. place names from Geonet
2.1 Digital Gazetteers Names Server (GNS) gazetteer of the National Geospatial
Gazetteer data exists in many independent and often dis- Intelligence Agency (NGA), as well as from other sources.
similar sources. Examples include the gazetteers of official
toponymic authorities or the place identifier tables accom- Within the ADL gazetteer initiative, XML schemas have
panying geographic information datasets. Beginning in the been proposed for encoding gazetteer data according to the
1990s, gazetteers morphed from atlas and monographs into content standard, reusing the geometry schema of the Ge-
digital databases, accessed via computer systems and latter ographic Markup Language (GML) for encoding geospatial
via Web services. In modern digital gazetteers, descriptive footprints. Web service interfaces for accessing gazetteer
entries (i.e., the data records) comprise metadata such as the data have also been proposed. The datasets used in the
multiple place names, place types, and geospatial footprints experiments reported in this work are all encoded accord-
corresponding to a single place, to which a unique identifier ing to the XML schema of the ADL gazetteer protocol, a
is also assigned. Since the data involves hierarchies (e.g., lightweight version of the ADL gazetteer content standard.
hierarchical organizations for the place type categories) and
indefinitely repeating groups (e.g., multiple place names or
geospatial footprints), gazetteers are a natural fit for XML 2.2 Database Record Linkage
encoding technologies. Increasingly, digital gazetteers are The problem of identifying database records that are syn-
used in conjunction with digital libraries and geographic tactically different and yet describe the same physical en-
information retrieval (GIR) systems, providing the means tity has been referred to as duplicate detection [13], iden-
for translating ambiguous place names into unambiguous tity uncertainty [14], object identification [18], merge/purge
geospatial coordinates [10, 11]. processing [17], de-duplication [15], and record linkage [22].
This paper adopts the term record linkage, as the main mo-
Two of the most relevant past initiatives regarding digi- tivation behind this work is the consolidation of gazetteer
tal gazetteers are (i) the Open Geographical Consortium’s records coming from multiple sources. Typical methods in-
(OGC) gazetteer service interface, and (ii) the gazetteer ser- volve the computation of a similarity score between pairs of
vice developed in the Alexandria Digital Library project. records, under the assumption that highly similar records
are likely to be duplicates [36]. For every candidate pair of
The Open Geographical Consortium (OGC) defined a gazet- records, a similarity is computed using some distance met-
teer service interface, i.e. WFS-G, based on a re-factored ric(s) or a probabilistic method. Candidate pairs that have
ISO-19112 content model (i.e., spatial referencing by geo- similarity scores higher than a given threshold value can
graphic identifiers) published through a Web Feature Ser- then be linked. The transitive closure of those linked points
vice (WFS1 ). The service metadata, operations, and types forms final equivalence classes for the duplicate records.
of geographic entities are standardized in this specification,
which in turns build on the Geography Markup Language On what concerns similarity measures, past record link-
(GML2 ) to encode the metadata associated with the places. age research has focused on distance metrics computed over
In particular, WFS-G uses two specific GML feature types, strings. Commonly used metrics include the Levenshtein
namely SI_Gazetteer and SI_LocationInstance, to encode distance [35] (e.g. derived from the minimum number of
gazetteer information. Work within the OGC regarding character deletions, insertions or substitutions required to
gazetteer services was heavily influenced by the previous equate two strings), the Monge-Elkan distance [13] (simi-
work within the Alexandria Digital Library project. lar to the Levenshtein distance, but assigning a relatively
lower cost to a sequence of insertions or deletions) or the
The Alexandria Digital Library (ADL) was one of the pio- Jaro-Winkler metric [22] (a fast heuristic-method specific
neering efforts addressing the development of gazetteer ser- metric for comparing proper names, which is based on the
vice protocols and data models, mainly to support infor- number and order of the common characters between two
mation retrieval over distributed resources [10]. The ADL strings and also accounts with common prefixes). Cohen
gazetteer content standard, i.e. a generic data model for et al. compared different string similarity metrics, con-
gazetteers, defines the core elements of named places (and cluding that the Jaro-Winkler metric works almost as well
their history), their spatial location (in various representa- as the Monge-Elkan scheme and is an order of magnitude
tions), their relationships to other places (e.g., part of re- faster [23]. While the approaches above work well for syn-
lations), classification (using a referenced typing scheme), tactic matches, record linkage should also account with pho-
and other metadata properties (e.g. source attribution). netic similarity. The double metaphone algorithm, popular-
Regarding place type classification, the ADL project de- ized by the ASpell spelling corrector, compares words on a
phonetic basis according to their pronunciation, returning a
1
www.opengeospatial.org/standards/wfs singular key value for similarly sounding words [24]. Words
2
www.opengeospatial.org/standards/gml with the same double metaphone key can be said to have a
similarity of 1.0, and a similarity of 0.0 can be assigned in a measure of confidence in the result, i.e. an estimate of the
the other cases. probability that the assigned class is the correct one.

Besides string similarity metrics, past research has also ad- This work reports experiments with both alternating deci-
dressed the computation of similarity scores between other sion tree and support vector machine classifiers, through the
types of objects. Lin, Resnik and others have suggested se- use of the implementations available in the Weka machine
mantic similarity measures for categorical information, based learning framework [4].
on having objects labeled with terms from a hierarchical tax-
onomy [25, 26, 27]. On what concerns categorical informa- 2.4 Computational Issues in Record Linkage
tion based on multi-sets of objects, the Jaccard coefficient In record linkage problems, evaluating all possible pairs of
measures similarity as the size of the intersection divided by duplicate records is highly inefficient [36]. However, because
the size of the union of the sample sets. Dice’s coefficient most of them are clearly dissimilar non-matches, only record
also measures similarity between multi-sets of objects, and pairs that are loosely similar (e.g. share common tokens) can
is defined as two times the size of the intersection divided be selected as candidates for matching, using blocking [17],
by the sum of the sizes of the sample sets. Both the Jaccard canopy clustering [18] or filtering techniques [29, 30, 31].
or the Dice coefficient can also be applied as string simi-
larity metrics, by seeing the strings as sets of characters or These three types of techniques share the fact that they ex-
even as sets of word tokens [23]. Software frameworks such plore computationally inexpensive similarity metrics in or-
as SimPack [8] implement a wide array of similarity metrics der to limit the number of comparisons the require the use
for different types of objects, facilitating the execution of of the expensive similarity metrics.
record linkage experiments.
Blocking refers to the process of grouping records by some
2.3 Machine Learning for Record Linkage variable which is known a priori to be usually the same for
Duplicate detection systems based on machine learning at- matching pairs (e.g., grouping pairs that belong to a same
tempt to exploit labelled training data in the form of record small geographic region). A particular case of a blocking
pairs that are marked as duplicates or non-duplicates by hu- technique is the Sorted Neighborhood (SN) method [17],
man editors [5, 6, 7, 15, 16, 19, 20]. Previously proposed sys- which works by sorting the records based on a sorting key
tems have explored labelled training examples at two levels, and then moving a window of a fixed size sequentially over
either individually or in combination. These levels are (i) us- the sorted records. Records within the window are then
ing trainable string metrics, such as learnable edit distance, paired with each other and included in the candidate record
that adapt textual similarity computations to specific record pair list. The use of the window limits the number of possi-
fields, and (ii) training a classifier to discriminate between ble record pair comparisons for each record to 2w − 1, where
pairs of duplicate and non-duplicate records using similarity w is the size of the window.
values for different fields as features. The work reported in
this paper falls into the second category, using binary classi- Canopy Clustering separates records into overlapping clus-
fiers that learn to combine different features to discriminate ters (i.e., canopies) of potential duplicates [18]. A canopy
duplicate gazetteer records from non-duplicates. cluster is formed by choosing a record at random from a
candidate set of records and then putting in its cluster all
In the machine learning literature, binary classification is the records within a certain threshold distance of it. Pairs
the supervised learning task of classifying the members of a of records that fall in each cluster become candidates for a
given set of objects (in our case, pairs of gazetteer records) full similarity comparison against the other records in the
into one of two groups, on the basis of whether they have same cluster.
some features or not. Methods proposed in the literature for
learning binary classifiers include decision trees [28, 32] and Filtering works by estimating upper bounds for the similar-
support vector machines [21]. ity scores involved in the detection of duplicates, using these
upper bounds for removing dissimilar pairs from the set of
Decision tree classifiers learn a tree-like model in which the candidates. The All-Pairs algorithm combines several filter-
leaves represent classifications and branches represent the ing techniques (e.g. string size and prefix equality) to reduce
conjunctions of features that lead to those classifications. the candidate set size, and was demonstrated to be highly
Decision tree classifiers provide high accuracy and trans- efficient and scalable to tens of millions of records [30]. The
parency (e.g., a human can easily examine the produced positional filtering principle exploits the ordering of tokens,
rules), although they can only output binary decisions (i.e., leading to upper bound estimates of similarity scores [29].
the gazetteer record pairs would either be duplicates or non-
duplicates). The Alternating Decision Tree algorithm is a 2.5 Gazetteer Record Linkage
generalization of decision tree classifiers introduced by Fre- Previous works have defined the problem of Geospatial En-
und and Mason, which in addition to classifications also tity Resolution as the process of defining from a collection
gives a measure of confidence (i.e. the classification mar- of database sources referring to locations, a single consoli-
gin) in the result being the correct classification [32]. dated collection of true locations [2]. The problem differs
from other record linkage scenarios mainly due to the pres-
Support Vector Machines (SVMs) work by determining an ence of a continuous spatial component in geospatial data.
hyper-plane that maximizes the total distance between itself
and representative data points (i.e., the support vectors) Unlike in the case of place names, which are often associated
transformed through a kernel function. SVMs can provide with problems of ambiguity (e.g., different places may share
the same name), geospatial footprints provide an unambigu- 3. THE PROPOSED APPROACH
ous form of geo-referencing. Record linkage should therefore Classifying pairs of gazetteer records as either duplicates or
be much more trivial in the case of geospatial data and in- non-duplicates, according to their similarity, is a binary, but
deed commercial GIS tools use spatial coordinates to join nonetheless hard, classification problem. Instead of applying
location references, through methods such as the one-sided a standard record linkage approach, based on string similar-
nearest join in which location references l1 ∈ A and l2 ∈ B ity metrics, this paper argues for the use specific similarity
are linked if l2 is closest to l1 given all locations in B. How- features better suited to the context of gazetteer records.
ever, in practice, data for the spatial domain is often noisy Before detailing the considered features, this section formal-
and imprecise. Different organizations often record geospa- izes the considered notion of gazetteer record and presents
tial footprints using a different scales, accuracies, resolutions some examples that illustrate the difficulties in detecting
and structure [2]. For example, one organization might rep- duplicate records.
resent features using only centroid coordinates, while an-
other may use detailed polygonal geometries. The general application scenario for the technique reported
in this paper is where we have two gazetteer record datasets
Beeri et al. used geospatial footprints to find matches be- A and B developed by independent sources. Each gazetteer
tween datasets, addressing the asymmetry in the definition record corresponds to some real-world geographic place over
of the one-sided nearest join [33]. Their results showed that the surface of the Earth. A geographic place is defined by
entity resolution is non-trivial even when using only the less (i) one or more place names, by which it is commonly known
ambiguous geospatial information. The use of non-spatial and communicated, (ii) one or more place types, situating
information may improve performance by identifying match- it in an agreed classification scheme that also provides the
ing locations which would otherwise be rejected by a spatial- conceptual bases for it, (iii) zero, one or more geospatial
only approach. Several previous works have proposed to footprints, corresponding to geo-referenced geometries lo-
combine both spatial and non-spatial features, although this cating it in the Earth’s surface and optionally designating
presents non-trivial problems due to the need of combining its areal configuration, and (iv) zero, one or more tempo-
semantically distinct similarity metrics. ral footprints, designating the intervals of validity for the
place. Each geospatial footprint may be given as a point,
One way to combine different similarity metrics is to put a line, bounding rectangle, or other polygonal shape that is
threshold on one, then using another metric as a secondary supported in the Geography Markup Language (GML) stan-
filter (i.e. helping in the rejection of similar locations accord- dard. Temporal footprints are given as calendar instants or
ing to the first metric that are not duplicates), and so on [1, calendar intervals, also according to the GML standard.
9]. Hastings described a gazetteer record linkage approach
based on human geocognition, focusing first on geospatial Often, a place is given a multiplicity of place names, place
footprints (since overlapping and/or near places can be the types and footprints by different people and groups for their
same), second on their type categories (conceptually near purposes over time. However, within a particular individual
place types can indicate the same place), and finally on their gazetteer, a real-world place and its gazetteer record should
names (place names with small variations in spelling or with stand in a one-to-one relationship. Thus, the construction of
abbreviations, elisions or transliterations, can indicate the new gazetteers from multiple data sources requires the han-
same place) [9]. However, the approach above does not cap- dling of duplicate gazetteer records, in this paper referred
ture the matches that are neither highly similar according to as the gazetteer record linkage problem. The objective of
to each of the individual similarity metrics. In this case, one gazetteer record linkage is therefore to find pairs of gazetteer
needs a single similarity measure which combines all the in- records < r1 , r2 >, such that r1 ∈ A and r2 ∈ B and both
dividual metrics into a single score. r1 and r2 correspond to the same real world place.

Previous approaches have proposed to use an overall similar- It should be noticed that place names frequently embed
ity between pairs of features, computed by taking a weighted place type information (e.g., Rua Cidade de Roma or Lux-
the average of the similarities between their individual at- embourg City) and some can even be misleading (e.g., Rio
tributes [12]. Weighted averages have the flexibility of giv- Grande do Sul, as a populated place as well as a water body).
ing some attributes more importance than others. However, Different places may share the same or similar centroid co-
manually tuning the individual weights can be difficult, and ordinates (e.g., Monaco the city or the state) and places can
machine learning methods offer a more robust approach. change boundaries (e.g., Berlin before and after the wall),
place types (e.g., Rio de Janeiro was the national capital of
Sehgal et al. also worked with overall similarity metrics Brazil before 1960) or place names (e.g., Congo was called
combining footprints (i.e. distance between centroids), place Zaire between 1971 and 1997) over time. The same regions
types, and place names (i.e., Levenshtein distance), but they of the globe can also correspond to different places over time
explored data-driven approaches using machine learning [2]. (e.g., the former Union of Soviet Socialist Republics).
This is the most similar work to the research reported in this
paper. Sehgal et al. mentioned that, for future work, they 3.1 Gazetteer Similarity Features
would be interested in exploring more semantic information
The feature vectors used in the proposed record linkage
and experimenting with more sophisticated similarity mea-
scheme combine information from several different meta-
sures. The work reported here goes in this direction, by
data elements available at the gazetteer records. The con-
experimenting with many different similarity metrics com-
sidered features can be grouped in five classes, namely (i)
puted over a large set of gazetteer record features.
place name similarity, (ii) geospatial footprint similarity, (iii)
place type similarity, (iv) semantic relationships similarity,
and (v) temporal footprint similarity. • Relative area of overlap between the geospatial foot-
prints associated with the gazetteer records, given by
In terms of the place name similarity features, the idea was Equation 1 originally proposed by Greg Janée3 :
to capture the intuition that placenames that are sufficiently
similar support the assessment that a common place is be-
ing described. The similarity calculations should make wide
allowances for spelling, abbreviations and elisions, transliter- area(F1 ∩ F2 )
ations, etc. Different types of textual similarity metrics have similarity(F1 , F2 ) = (1)
area(F1 ∪ F2 )
complementary strengths and weaknesses. Consequently, it
is useful to consider multiple similarity metrics when evalu-
ating potential duplicates. The following features were con- In terms of the place type similarity, the intuition is that
sidered in the experiments: places having the same type are more likely to be duplicates.
Given the hierarchical nature of the Alexandria FTT, we can
also have a sense of proximity between the place types (i.e.,
• The Levenshtein distance between the main place names the place types cities and national capitals are conceptually
associated with the gazetteer records. similar and possible descriptors for the same place). The
considered set of features is as follows:
• The Jaro-Winkler distance between the main place
names associated with the gazetteer records.
• The Jaccard coefficient between the sets of place type
• The Monge-Elkan distance between the main place classes associated with the gazetteer features.
names associated with the gazetteer records. • The Overlap coefficient between the sets of place type
classes associated with the gazetteer features.
• The Double Metaphone distance between the main
place names associated with the gazetteer records. This • The equality of the main place type class associated
metric addresses the case in which place names are with the gazetteer features (i.e., one if they are equal
slightly misspelled. and zero otherwise).

• The Jaccard coeficient between the sets of alternative • The semantic similarity metric proposed by Lin, com-
names associated with the gazetteer records. puted over the FTT nodes corresponding to the main
place type class associated with the gazetteer features.
• The Overlap coeficient between the sets of alternative
names associated with the gazetteer records. • The semantic similarity metric proposed by Resnik,
computed over the FTT nodes corresponding to the
• The minimum and maximum Levenshtein distances main place type class associated with the features.
between the names given in the sets of alternative • The count of the up-steps (to broader terms) neces-
names associated with the gazetteer records. sary to bring the place types to the lowest (narrowest)
term that subsumes them both, computed over the
• The minimum and maximum Jaro-Winkler distances FTT nodes corresponding to the main place type class
between the names given in the sets of alternative associated with the gazetteer features.
names associated with the gazetteer records.

In terms of the similarity measures corresponding to the


In terms of the geospatial footprint similarity, the idea was semantic relations, the intuition is that places related to the
to support the intuition that places whose locations are not same other places are more likely to be duplicates. The
close cannot be the same. The following features were con- considered set of features is as follows:
sidered in the experiments:

• The Jaccard coefficient for the set of related features


• The two areas that cover the geospatial footprints as- associated with the gazetteer records, for each of the
sociated with the pair of gazetteer records. relationship types supported in the ADL gazetteer pro-
tocol (i.e., part of, state of, county of, country of, ...)
• Minimum distance between the geospatial footprints
associated with the gazetteer records. • The Overlap coefficient for the set of related features
associated with the gazetteer records, for each of the
• Distance between the centroid points of the areas that relationship types supported in the ADL gazetteer pro-
cover the geospatial footprints associated with the records. tocol (i.e., part of, state of, county of, country of, ...)

• Normalized distance between the centroid points of the


areas that cover the geospatial footprints associated Finally, in terms of the temporal similarity, the intuition is
2
with the records, given by similarity = e−distance . that places for the same temporal period are more likely to
be similar. The following features were considered:
• Area of overlap between the geospatial footprints as- 3
http://www.alexandria.ucsb.edu/~gjanee/archive/
sociated with the gazetteer records. 2003/similarity.html
• The overlap between the time periods associated with the training data better, although it can also result in over-
the gazetteer records. fitting the model to the training data.
• The difference between the values for the center date The feature vectors experimented with both classification
associated with the gazetteer records. algorithms correspond to different combinations of the fea-
tures described in the previous section. Some experiments
were also made with feature selection approaches, for in-
It should nonetheless be noted that few of the gazetteer stance measuring the Information Gain statistic associated
records considered in the experiments actually have defined with the individual features. The Weka machine learning
a temporal period for their validity. The impact of these framework provides the implementations for the classifica-
features on the obtained results is therefore neglectable. tion and feature selection algorithms.
In what concerns the implementation of the feature extrac-
tion stage, and as perviously stated, all the gazetteer records 4. EXPERIMENTAL EVALUATION
used in the experiments were encoded in the XML format This section describes the datasets, the metrics, and the
of the Alexandria Digital Library gazetteer protocol. Small results for the experimental evaluation of the proposed ap-
changes were introduced in the ADL protocol XML Schema proach. The section ends with an analysis on filtering tech-
in order to support the association of gazetteer features to niques for scaling the proposed approach to large sets of
temporal periods (i.e., according to the ADL specifications, gazetteer records.
temporal information is originally only considered in the full
gazetteer content standard). 4.1 The Test Collection
Evaluating the accuracy of duplicate detection requires a
The loading and parsing of the Alexandria Digital Library gold-standard dataset in which all duplicate records have
gazetteer records was made through the use of the Apache been identified. The experiments reported in this paper used
XMLBeans4 framework. The required geospatial computa- a set of gazetteer records containing both duplicate and non-
tions were implemented through the use of the Java Topol- duplicate examples, having the records encoded according to
ogy Suite (JTS), an API of 2D spatial predicates and func- XML schema of the ADL gazetteer service. The entire set
tions that supports the computation of area aggregates and contains 1,257 records describing places from all around the
area intersections [8]. The geospatial footprints are brought globe. A total of 1,927 record pairs have been manually an-
to a common datum and projection prior to record linkage, notated as duplicates. An analysis of the duplicate cases
also through the use of JTS. The similarity metrics came revealed that different situations occur in the dataset, in-
from the SimPack package of Java similarity functions [8], cluding placenames with different spellings (e.g., Wien and
except for the DoubleMetaphone distance for which a new Vienna), encoded with different geospatial footprints (e.g.
implementation was made. either through centroid coordinates, bounding rectangles or
complex polygonal geometries) or place types (e.g., Lisbon is
3.2 The Classification Methods both a city and a populated place), and containing more or
This paper reports experiments with two different types of less detailed information concerning relationships to other
classifiers, namely support vector machines (SVMs) and al- place names.
ternating decision tree classifiers. SVMs, based on linear or
nonlinear regression, represent the state-of-the-art classifi- The naive approach of using all possible record pairs de-
cation technology. They also offer the possibility to assign scribed in the dataset (in our case 790,653 record pairs with
a value in the interval [0,1] that estimates the probability of only 1,927 being duplicates) results in a training set contain-
the document being either local or global. This confidence ing much more negative non-duplicate examples than dupli-
estimates can be viewed as an overall measure of similarity cate examples. This skew not only impacts the usefulness
between the gazetteer records comprising the pair. SVMs of a classification approach (e.g., the accuracy would remain
exhibit remarkable resistance to noise, handle correlated fea- high even if we classified all instances as non-duplicates), but
tures well, and rely only on most informative training exam- also makes the learning process highly inefficient. Taking
ples, which leads to a larger independence from the relative inspiration on the approaches proposed by Sehgal et al. [2]
sizes of the sets of positive and negative examples. Alter- and by Bilenko and Mooney [7], the following algorithm was
nating tree classifiers can provide more interpretable results taken to select pairs to be used in the test collection:
(i.e., the decision tree), while at the same time also assigning
a confidence value n the interval [0,1]. 1. All the pairs annotated as duplicates are initially added
to the test collection.
Each of these classifiers models the problem with different
levels of complexity. For instance, the alternating decision 2. Assuming that initially we have a set of size n with
tree classifier tries to define a function that logically parti- the pairs annotated as duplicates, randomly select n/2
tions the classification space in terms of a tree of decisions pairs annotated as non-duplicates and add them to the
made over attributes of the original data, whereas SVMs test collection.
use a kernel function to flexibly map the original data into
3. The remaining pairs annotated as non-duplicates are
a higher-dimensional space where a separating hyper-plane
sorted according to different similarity metrics, namely
can be defined. A more complex function, such as the SVM
the Levenshtein distance on the place names, the cen-
approach with a non-linear kernel, may be able to model
troid distance, the class overlap and the feature type
4
http://xmlbeans.apache.org/ distance.
Levenstein similarity Jaro−Winkler similarity and Mooney advocate that standard information retrieval
2000
evaluation metrics, namely precision-recall curves, provide

2000
the most informative evaluation methodology [7]. In this
1500

1500
work, precision and recall were computed in terms of classi-
Record pairs

Record pairs
fying the duplicate pairs. Precision is the ratio of the num-
1000

1000
ber of items correctly assigned to the class divided by the
500

500
total number of items assigned to the class. Recall is the
ratio of the number of items correctly assigned to a class
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 as compared with the total number of items in the class.
Similarity score Similarity score Since precision can be increased at the expense of recall, the
F-measure (with equal weights) was also computed to com-
Normalized distance Relative overlap
bine precision and recall into a single number. Given the
50000
2500

binary nature of the classification problem, results were also


measured in terms of accuracy and error. Accuracy is the
2000

Document−topic pairs

30000

proportion of correct results (both true positives and true


Record pairs

1500

negatives) given by the classifier. Error, on the other hand,


1000

measures the proportion of instances incorrectly classified,


10000
500

considering false positives plus false negatives.


0
0

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.01 0.02 0.03 0.04

Similarity score Values for the similarity metric 4.3 The Obtained Results
SVM and alternating decision tree classifiers were trained
Figure 1: Histograms with the frequency distribu- using different combinations of the proposed features. The
tion for different similarity scores. considered feature combinations are as follows:

1. Use only the place name similarity features.


4. Select n/2 pairs from the remaining non-duplicates,
2. Use only the geospatial footprint similarity features.
iterating in a round-robin fashion from the different
ordered lists. 3. Use the place name similarity features plus the geospa-
tial footprint similarity features.
The procedure above results in a test collection containing 4. Use the entire set of proposed similarity features.
the same number of duplicate and non-duplicate pairs, and
it also includes a balance between easy (i.e. highly dissimilar
records) and difficult cases (i.e. similar records) for the clas- In each case of the above cases, a 10 fold cross validation
sifier to decide, the latter being more informative for train- was performed. Table 2 overviews the results obtained for
ing than randomly selected record pairs. The total number each of the different combinations.
of record pairs used in the experiments is therefore of 3,853.
Table 1 presents a statistical characterization of the test col- The results show that the alternating decision tree classi-
lection and the histograms in Figure 1 show the distribution fiers consistently outperform the classifiers based on support
of the similarity scores for the Levenshtein similarity, the vector machines. A decision tree classifier that used all the
Jaro-Winkler metric, the normalized centroid distance and proposed similarity features achieved the top performance
the relative area of overlap. (i.e., an accuracy of 98.313), although the usage of place
name similarity alone provides a very competitive baseline
Number of records 1,257.00 (i.e., an accuracy of 97.9496). The top performing classifier
Number of place names 1,753.00 corresponds to a decision tree with 41 nodes and 21 leafs,
Number of unique place names 1,114.00 using 9 different features. The root of the decision tree tests
Average names per record 1.39 if the overlap coefficient between the sets of place names if
Number of considered place types 8.00 less or more than zero.
Average number of types per record 1.11
Records with temporal data 11.00 Using the geospatial footprint similarity metrics alone re-
Records with only centroid coords. 240.00 sults in an accuracy of 94.6535 and the combination of the
Total number of pairs 790,653.00 name plus the footprint similarities results in an accuracy
Total number of duplicates 1,927.00 of 98.2611. The remaining types of similarity features (e.g.,
Considered number of pairs 3,853.00 place type similarity) seem to have a limited impact on the
Considered number of duplicates 1,927.00 final results.

It is important to notice that though using place name sim-


Table 1: Characterization of the test collection. ilarity alone results in an accuracy that is close to the best
reported combination of features, some caveats should be at-
4.2 The Evaluation Metrics tached to this conclusion. The collection of gazetteer records
A variety of experimental methodologies have been used to that was used in the experiments only contains relatively
evaluate the accuracy of record linkage approaches. Bilenko high-level places (e.g. cities). Since place name ambiguity is
Support Vector Machines Alternating Decision Trees
Precision Recall F1 Accuracy Error Precision Recall F1 Accuracy Error
Place names 0.993 0.954 0.973 97.3787 2.6213 0.988 0.971 0.979 97.9496 2.0504
Footprints 0.797 0.941 0.863 85.0506 14.9494 0.944 0.95 0.947 94.6535 5.3465
Names + footprints 0.992 0.958 0.975 97.5603 2.4397 0.989 0.976 0.983 98.2611 1.7389
All 0.992 0.962 0.977 97.6901 2.3099 0.987 0.979 0.983 98.313 1.687

Table 2: Performance comparison between classification algorithms and different feature vectors.

Levenshtein Similarity Jaro−Winkler Similarity Monge−Elkan Similarity Normalized Distance Relative Overlap Similarity
80 100

80 100

80 100

80 100

80 100
Missed duplicates Missed duplicates Missed duplicates Missed duplicates Missed duplicates
Tested pairs Tested pairs Tested pairs Tested pairs Tested pairs
Percentage of record pairs

Percentage of record pairs

Percentage of record pairs

Percentage of record pairs

Percentage of record pairs


60

60

60

60

60
40

40

40

40

40
20

20

20

20

20
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Similarity threshold Similarity threshold Similarity threshold Similarity threshold Similarity threshold

Figure 2: Effect of pre-filtering pairs with basis on simple similarity scores.

likely to itself manifest more over thin-grained places (e.g., the process is repeated until no single feature improves per-
small villages or street names), the performance of place formance. An alternating decision tree classifier was build
name similarity should drop in the case of gazetteer records through this procedure. The results obtained with this clas-
containing thin-grained information. sifier correspond to an accuracy of 97.4306, considering a
total of eight features. Of these eight features, two are re-
Using the information gain statistic, a specific test exam- lated to the similarity between spatial footprints and one
ined which of the proposed features are the most informa- concerns with place types. The remaining features corre-
tive. For each of the four scenarios considered in Table 2, spond to place name similarity metrics.
Table 3 lists the top five most informative features. The
results show that indeed place name similarity provides the A last experiment examined how the proposed categoriza-
most informative features in discriminating between dupli- tion approach can scale to large collections, through the us-
cate and non-duplicate gazetteer records. age of pre-filtering techniques. The idea is that, instead
of comparing all pairs, we can pre-filter the pairs to be
1 - Names 2 - Footprints compared according to individual similarity scores produced
Jaccard Names Centroid Distance by highly informative and computationally inexpensive ap-
Overlap Names Distance proaches. The charts in Figure 2 show, for five different
Levenshtein Primary Normalized Distance pre-filtering techniques, how many duplicate records would
JaroWinkler Primary Relative Overlap be missed, if we only compared record pairs having a simi-
MongeElkan Primary Overlap larity score above a given threshold.
3 - Names+Footprints 4 - All
Jaccard Names Jaccard Names The results show that the both the Levenshtein and Jaro-
Overlap Names Overlap Names Winkler metrics, computed over the primary place names
Levenshtein Primary Levenshtein Primary associated with the gazetteer records, provide good pre-
JaroWinkler Primary JaroWinkler Primary filtering approaches. For instance, if we consider only the
Centroid Distance Centroid Distance records pairs where the Levenshtein similarity is above 0.9,
automatically classifying all other pairs as non-duplicates,
we would only be missing around 15.4 percent of the full
Table 3: Top five most informative features. set of duplicate pairs. At the same time, the number of
pairs to compare through the computationally more expen-
sive procedure involving the use of the full set of similarity
Feature selection has a long history within the field of ma-
features would be reduced to around 41.8 percent of the
chine learning. Through appropriate feature selection, one
entire set of record pairs. Computing thresholds on the nor-
can often build classifiers that are both more efficient (i.e.,
malized distance and on the relative overlap is also rela-
using less features) and more accurate. A specific experi-
tively inexpensive, particularly if one considers appropriate
ment address the use of a greedy forward feature selection
indexing mechanisms. However, they provide much worse
method. Greedy forward selection begins with a set of pre-
pre-filtering heuristics.
selected features S which is typically initialized as empty.
For each feature fi not yet in S, a model is trained and
evaluated using the feature set S ∪ fi . The feature that
provides the largest performance gain is added to S, and 5. CONCLUSIONS AND FUTURE WORK
This paper presented a machine learning approach, more Finally, previous works have also addressed semi-automated
specifically a supervised classification approach, for gazetteer gazetteer record linkage, through a user-interface specifically
record linkage. It reported a thorough evaluation of two designed for the task [3]. Currently ongoing work also goes
different classification approaches (i.e., SVMs and alternat- in this direction, by studying user interfaces for the man-
ing decision tree classifiers) using feature vectors that com- agement of gazetteer information in which human editors
bine different aspects of similarity between pairs of gazet- are asked to access the results of automated approaches for
teer records. These aspects are (i) place name similarity, gazetteer record linkage, and latter are asked on how to per-
(ii) geospatial footprint similarity, (iii) place type similar- form record fusion. Fusion is particularly hard, since it has
ity, (iv) semantic relationships similarity, and (v) temporal to deal with problems of inconsistency, redundancy, ambigu-
footprint similarity. Both SVMs and alternating decision ity, and conflict of information in the collection. The over-
tree classifiers are adequate to the task, with alternating de- all objective is to have redundancies eliminated, accurate
cision trees performing slightly better. The usage of all the data retained and data conflicts reconciled, in building use-
different types of similarity features leads to an increase in ful gazetteer datasets resulting from the fusion of multiple
accuracy, although the similarity between place names is the heterogeneous sources.
most informative feature.
Acknowledgments
Despite the promising results, there are also many chal- Removed in order to preserve the anonymity of authorship.
lenges for future work. Previous studies have acknowledged
that entity resolution is further complicated by the fact that
the different data sources may use different vocabularies
6. REFERENCES
[1] J. Hastings, and L. Hill (2002) Treatment of Duplicates
for describing the location types, motivating the use of se- in the Alexandria Digital Library Gazetteer. In
mantic mapping for cross-walking between the classification
Proceedings of the 2002 GeoScience Conference.
schemes [2]. In this work, it was assumed that the gazetteer
[2] V. Sehgal, L. Getoor, and P. D. Viechnicki (2006)
records were all using the classification scheme of the ADL
Entity Resolution in Geospatial Data Integration.
feature type thesaurus and limited the experiments to using
Proceedings of the 14th International Symposium on
similarity metrics between nodes in this thesaurus. An in-
Advances on Geographical Information Systems.
teresting direction for future work would be to explore the
use of a similarity metric that accounted with the semantic [3] H. Kang, V. Sehgal, and L. Getoor (2007) GeoDDupe:
differences between the feature types. It should nonetheless A Novel Interface for Interactive Entity Resolution in
be noted that feature type similarity scores were not among Geospatial Data. In Proceeding of the 11th IEEE
the most informative features, and it seems reasonable to International Conference on Information Visualisation.
assume that it is possible to accurately identify duplicates [4] I. H. Witten, and R. Frank (2000) Data Mining:
even without a common classification scheme. Practical Machine Learning Tools and Techniques with
Java Implementations. Morgan Kaufmann.
Previous works have also noted that the standard textual [5] M. Bilenko, B. Kamath, and R. J. Mooney (2006)
similarity metrics are not well suited to place names be- Adaptive Blocking: Learning to Scale Up Record
cause, in everyday usage, their stylistic variability is too Linkage and Clustering. In Proceedings of the 6th
great [1]. Also, at the token level, certain words can be IEEE International Conference on Data Mining.
very informative when comparing two strings for equiva- [6] M. Bilenko, and R. J. Mooney (2006) Adaptive
lence, while others are ignorable. For example, ignoring the Duplicate Detection Using Learnable String Similarity
substring street may be acceptable when comparing address Measures. In Proceedings of the 9th ACM Conference
names, but not when comparing names of people (e.g. James on Knowledge Discovery and Data Mining.
Street). Advanced string similarity metrics specific for geo- [7] M. Bilenko, and R. J. Mooney (2003) On Evaluation
graphic names have been proposed in the past [1, 34] and it and Training-Set Construction for Duplicate Detection.
would be interesting to integrate these metrics, as additional In Proceedings of the KDD-2003 Workshop on Data
features, in the proposed machine learning framework. Cleaning, Record Linkage, and Object Consolidation.
[8] A. Bernstein, E. Kaufmann, C. Kiefer, and C. Bürki
When detecting duplicates, besides the information that is (2005) Simpack: A Generic Java Library for Similiarity
available in the pairs of gazetteer records being compared, Measures in Ontologies (Working Paper). Department
information available on related gazetteer records may also of Informatics, University of Zurich.
prove useful useful. For instance, in two dispar sources, du- [9] J. T. Hastings (2008) Automated conflation of digital
plicate places may be described through records with com- gazetteer data. International Journal Geographic
pletely different names. However, the majority of the gazet- Information Science, 22(10).
teer records associated with the two will share many names [10] L. L. Hill (2000) Core Elements of Digital Gazetteers:
in common. Similarly to Samal et al. [12], future work could Placenames, Categories, and Footprints. Proceedings of
address the extension of the pairwise similarity between fea- the 4th European Conference on Research and
tures in order to consider contextual information given by re- Advanced Technology for Digital Libraries.
lated features with a high semantic (i.e., associated through [11] L. L. Hill (2006) Georeferencing: The Geographic
part-of relationships) or geographic proximity (i.e., through Associations of Information, The MIT Press.
centroid distance). It should nonetheless be noted that even
[12] A. Samal, S. Seth, and K. Cueto (2004) A
the relatively simple metrics used in the experiments re-
feature-based approach to conflation of geospatial
ported here already provide very accurate results.
sources. International Journal of Geographical
Information Science, 18.
[13] A. E. Monge, and C. Elkan (1996) The field matching [29] C. Xiao, W. Wang, X. Lin, and J. X. Yu (2008)
problem: Algorithms and applications. In Proceedings Efficient similarity joins for near duplicate detection. In
of the 2nd International Conference on Knowledge Proceeding of the 17th international Conference on
Discovery and Data Mining. World Wide Web.
[14] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. [30] R. J. Bayardo, Y. Ma, and R. Srikant (2007) Scaling
Shpitser (2003) Identity uncertainty and citation up all pairs similarity search. In Proceeding of the 16th
matching. Advances in Neural Information Processing international Conference on World Wide Web.
Systems, 15. [31] A. Arasu, V. Ganti, and R. Kaushik (2006) Efficient
[15] S. Sarawagi, and A. Bhamidipaty (2002) Interactive exact set-similarity joins. In Proceedings of the 32nd
deduplication using active learning. In Proceedings of International conference on Very Large Data Bases.
8th ACM Conference on Knowledge Discovery and [32] Y. Freund, and L. Mason (1999) The alternating
Data Mining. decision tree learning algorithm. In Proceedings of the
[16] W. E. Winkler (2002) Methods for record linkage and 16th International Conference on Machine Learning.
Bayesian networks. Technical report, Statistical [33] C. Beeri, Y. Kanza, E. Safra, and Y. Sagiv (2004)
Research Division, U.S. Census Bureau. Object fusion in geographic information systems. In
[17] M. A. Hernandez, and S. J. Stolfo (1995) The Proceedings of the 30th International Conference on
merge/purge problem for large databases. In Very Large Data Bases.
Proceedings of the 1995 ACM Conference on [34] C. Davis, and E. Salles (2007) Approximate String
Management of Data. Matching for Geographic Names and Personal Names.
[18] A. K. McCallum, K. Nigam, and L. Ungar (2000) In Proceedings of the 9th Brazilian Symposium on
Efficient clustering of high-dimensional datasets with GeoInformatics.
application to reference matching. In Proceedings of [35] V. I. Levenshtein (1966) Binary codes capable of
6th ACM Conference on Knowledge Discovery and correcting deletions, insertions, and reversals. Soviet
Data Mining. Physics Doklady, 10.
[19] W. W. Cohen, and J. Richman (2002) Learning to [36] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios
match and cluster large high-dimensional datasets for (2007) Duplicate Record Detection: A Survey. IEEE
data integration. In Proceedings of 8th ACM Transactions on Knowledge and Data Engineering,
Conference on Knowledge Discovery and Data Mining. 19(1).
[20] S. Tejada, C. A.Knoblock, and S. Minton (2002)
Learning domain-independent string transformation
weights for high accuracy object identification. In
Proceedings of 8th ACM Conference on Knowledge
Discovery and Data Mining.
[21] T. Joachims (1999) Making large-scale SVM learning
practical. In B. Scholkopf, C. J. C. Burges, and A. J.
Smola, editors, Advances in Kernel Methods - Support
Vector Learning, MIT Press.
[22] W. E. Winkler (2006) Overview of Record Linkage
and Current Research Directions. Technical report,
Statistical Research Division, U.S. Census Bureau.
[23] W. Cohen, P. Ravikumar, and S. Fienberg (2003) A
comparison of string distance metrics for
name-matching tasks. In Proceedings of 9th ACM
Conference on Knowledge Discovery and Data Mining.
[24] P. Lawrence (2000) The Double Metaphone Search
Algorithm, C/C++ Users Journal.
[25] D. Lin (1998) An information-theoretic definition of
similarity. Proceedings of the 15th International
Conference on Machine Learning.
[26] P. Resnik (1999) Semantic similarity in a taxonomy:
An information-based measure and its application to
problems of ambiguity in natural language. Journal of
Artificial Intelligence Research, 11.
[27] P. Schwarz, Y. Deng, and J. E. Rice (2006) Finding
Similar Objects Using a Taxonomy: A Pragmatic
Approach. Proceedings of the 5th International
Conference on Ontologies, Databases and Applications
of Semantics.
[28] S. R. Safavian, and D. Landgrebe (1991) A survey of
decision tree classifier methodology. IEEE Transactions
on Systems, Man and Cybernetics, 21(3).