Streaming Cross Document Entity Coreference Resolution

Streaming Cross Document Entity Coreference Resolution
Delip Rao and Paul McNamee and Mark Dredze

Human Language Technology Center of Excellence
Center for Language and Speech Processing
Johns Hopkins University
delip,mcnamee,mdredze@jhu.edu
Abstract (“Barack Obama”) followed by additional expres-
sions for the same entity (“President Obama.”)
Previous research in cross-document en- An intra-document coreference system must first
tity coreference has generally been re- identify each reference, often relying on named
stricted to the offline scenario where the entity recognition, and then decide if these refer-
set of documents is provided in advance. ences refer to a single individual or multiple enti-
As a consequence, the dominant approach ties, creating a coreference chain for each unique
is based on greedy agglomerative cluster- entity. Feature representations include surface
ing techniques that utilize pairwise vec- form similarity, lexical context of mentions, po-
tor comparisons and thus require O(n2 ) sition in the document and distance between ref-
space and time. In this paper we ex- erences. A variety of statistical learning meth-
plore identifying coreferent entity men- ods have been applied to this problem, including
tions across documents in high-volume use of decision trees (Soon et al., 2001; Ng and
streaming text, including methods for uti- Cardie, 2002), graph partitioning (Nicolae and
lizing orthographic and contextual infor- Nicolae, 2006), maximum-entropy models (Luo
mation. We test our methods using several et al., 2004), and conditional random fields (Choi
corpora to quantitatively measure both the and Cardie, 2007).
efficacy and scalability of our streaming Given pre-processed documents, in which enti-
approach. We show that our approach ties have been identified and entity mentions have
scales to at least an order of magnitude been linked into chains, we seek to identify across
larger data than previous reported meth- an entire document collection all chains that re-
ods. fer to the same entity. This task is called cross
document coreference resolution (CDCR). Sev-
1 Introduction
eral of the challenges associated with CDCR dif-
A key capability for successful information ex- fer from the within document task. For example,
traction, topic detection and tracking, and ques- it is unlikely that the same document will discuss
tion answering is the ability to identify equiva- John Phillips the American football player and
lence classes of entity mentions. An entity is a John Phillips the musician, but it is quite proba-
real-world person, place, organization, or object, ble that documents discussing each will appear in
such as the person who serves as the 44th pres- the same collection. Therefore, while matching
ident of the United States. An entity mention is entities with the same mention string can work
a string which refers to such an entity, such as well for within document coreference, more so-
“Barack Hussein Obama”, “Senator Obama” or phisticated approaches are necessary for the cross
“President Obama”. The goal of coreference res- document scenario where a one-entity-per-name
olution is to identify and connect all textual entity assumption is unreasonable.
mentions that refer to the same entity. One of the most common approaches to both
The first step towards this goal is to identify all within document and cross document corefer-
references within the same document, or within ence resolution has been based on agglomerative
document coreference resolution. A document of- clustering, where vectors might be bag-of-word
ten has a leading canonical reference to the entity contexts (Bagga and Baldwin, 1998; Mann and
Yarowsky, 2003; Gooi and Allan, 2004; Chen theoretic evaluation measure, B-CUBED; intro-
and Martin, 2007). These algorithms creates a duction of a data set based on 197 New York
O(n2 ) dependence in the number of mentions – Times articles which mention a person named
for within document – and documents – for cross John Smith; and, use of TF/IDF weighted vec-
document. This is a reasonable limitation for tors and cosine similarity in single-link greedy ag-
within document, since the number of references glomerative clustering.
will certainly be small; we are unlikely to en- Mann and Yarowsky (2003) extended Bagga
counter a document with millions of references. and Baldwin’s work and contributed several inno-
In contrast to the small n encountered within a vations, including: use of biographical attributes
document, we fully expect to run a CDCR sys- (e.g., year of birth, occupation), and evaluation us-
tem on hundreds of thousands or millions of doc- ing pseudonames. Pseudonames are sets of artifi-
uments. Most previous approaches cannot handle cially conflated names that are used as an efficient
collections of this size. method for producing a set of gold-standard dis-
In this work, we present a new method for ambiguations.1 Mann and Yarowsky used 4 pairs
cross document coreference resolution that scales of conflated names in their evaluation. Their sys-
to very large corpora. Our algorithm operates in tem did not perform as well on named entities with
a streaming setting, in which documents are pro- little available biographic information.
cessed one at a time and only a single time. This Gooi and Allan (2004) expanded on the use
creates a linear (O(n)) dependence on the num- of pseudonames by semi-automatically creating
ber of documents in the collection, allowing us a much larger evaluation set, which they called
to scale to millions of documents and millions the ’Person-X’ corpus. They relied on automated
of unique entities. Our algorithm uses stream- named-entity tagging and domain-focused text re-
ing clustering with common coreference similar- trieval. This data consisted of 34,404 documents
ity computations to achieve large scale. Further- where a single person mention in each document
more, our method is designed to support both was rewritten as ’Person X’. Besides their novel
name disambiguation and name variation. construction of a large-scale resource, they in-
In the next section, we give a survey of related vestigated several minor variations in clustering,
work. In Section 3 we detail our streaming setup, namely (a) use of Kullback-Leibler divergence as
giving a description of the streaming algorithm a distance measure, (b) use of 55-word snippets
and presenting efficient techniques for represent- around entity mentions (vs. entire documents or
ing clusters over streams and for computing simi- extracted sentences), and (c) scoring clusters us-
larity. Section 4 describes the data sets on which ing average-link instead of single- or complete-
we evaluate our methods and presents results. We link.
conclude with a discussion and description of on- Finally, in more recent work, Chen and Martin
going work. (2007) explore the CDCR task in both English and
Chinese. Their work focuses on use of both lo-
2 Related Work cal, and document-level noun-phrases as features
Traditional approaches to cross document coref- in their vector-space representation.
erence resolution have first constructed a vector There have been a number of open evaluations
space representation derived from local (or global) of CDCR systems. For example, the Web People
contexts of entity mentions in documents and then Search (WePS) workshops (Artiles et al., 2008)
performed some form of clustering on these vec- have created a task for disambiguating personal
tors. This is a simple extension of Firth’s distribu- names from HTML pages. A set of ambiguous
tional hypothesis applied to entities (Firth, 1957). names is chosen and each is submitted to a popular
We describe some of the seminal work in this area. web search engine. The top 100 pages are then
Some of the earliest work in CDCR was by manually clustered.We discuss several other data
Bagga and Baldwin (1998). Key contributions 1
See Sanderson (2000) for use of this technique in word
of their research include: promotion of a set- sense disambiguation.
sets in Section 4.2 the system can be examined at any time to produce
All of the papers mentioned above focus on dis- this information based on only the documents that
ambiguating personal names. In contrast, our sys- have been processed thus far.
tem can also handle organizations and locations. In the next sections, we describe both the clus-
Also, as was mentioned earlier, we are commit- tering algorithm and efficient computation of the
ted to a scenario where documents are presented entity similarity scores.
in sequence and entities must be disambiguated
instantly, without the benefit of observing the en- 3.1 Clustering Algorithm
tire corpus. We believe that such a system is bet- We use a streaming clustering algorithm to cre-
ter suited to highly dynamic environments such as ate entity clusters as follows. We observe a set
daily news feeds, blogs, and tweets. Additionally, of points from a potentially infinite set X , one at
a streaming system exposes a set of known entity a time, and would like to maintain a fixed number
clusters after each document is processed instead of clusters while minimizing the maximum cluster
of waiting until the end of the stream. radius, defined as the radius of the smallest ball
containing all points of the cluster. This setup is
3 Approach well known in the theory and information retrieval
Our cross document coreference resolution sys- community and is referred to as the dynamic clus-
tem relies on a streaming clustering algorithm tering problem (Can and Ozkarahan, 1987).
and efficient calculation of similarity scores. We Others have attempted to use an incremen-
assume that we receive a stream of corefer- tal clustering approach, such as Gooi and Al-
ence chains, along with entity types, as they lan (2004) (who eventually prefer a hierarchi-
are extracted from documents. We use SERIF cal clustering approach), and Luo et al. (2004),
(Ramshaw and Weischedel, 2005), a state of the who use a Bell tree approach for incrementally
art document analysis system which performs clustering within document entity mentions. Our
intra-document coreference resolution. BBN de- work closely follows the Doubling Algorithm of
veloped SERIF to address information extraction Charikar et al. (1997), which has better perfor-
tasks in the ACE program and it is further de- mance guarantees for streaming data. Streaming
scribed in Pradhan et al. (2007). clustering means potentially linear performance in
Each unique entity is represented by an entity the number of observations since each document
cluster c, comprised of entity chains from many need only be examined a single time, as opposed
documents that refer to the same entity. Given to the quadratic cost of agglomerative clustering.3
an entity coreference chain e, we identify the best The Doubling Algorithm consists of two stages:
known entity cluster c. If a suitable entity cluster update and merge. Update adds points to existing
is not found, a new entity cluster is formed. clusters or creates new clusters while merge com-
An entity cluster is selected for a given corefer- bines clusters to prevent the clusters from exceed-
ence chain using several similarity scores, including a fixed limit. New clusters are created accord-
ing document context, predicted entity type, and ing to a threshold set using development data. We
orthographic similarity between the entity men- selected a threshold of 0.5 since it worked well in
tion and previously discovered references in the preliminary experiments. Since the number of en-
entity cluster. An efficient implementation of the tities grows with time, we have skipped the merge
similarity score allows the system to identify the step in our initial experiments so as not to limit
top k most likely mentions without considering all cluster growth.
m entity clusters. The final output of our sys- We use a dynamic caching scheme which backs
tem is a collection of entity clusters, each con- the actual clusters in a disk based index, but re-
taining a list of coreference chains and their doc- 3
It is possible to implement hierarchical agglomerative
uments. Additionally, due to its streaming nature, clustering in O(n log m) time where n is the number of
points and m in the number of clusters. However this is still
2
We preferred other data sets to the WePS data in our superlinear and expensive in situations where m continually
evaluation because it is not easily placed in temporal order. increases like in streaming coreference resolution.
grams generated for each cluster.
4.0
PER When given a new head mention e for a coref-
3.5 LOC erence chain, we generate all of the n-grams and
ORG
3.0 look up clusters that contain these n-grams using
log(Frequency)
2.5 the hash. We then compute the dice score:

2.0
|{ngram(e)} ∩ {ngram(c)}|
1.5 dice(e, c) = ,
|{ngram(e)} ∪ {ngram(c)}|
1.0
0.5 where {ngram(e)} are the set of n-grams in entity
0.0 mention e and {ngram(c)} are the set of n-grams
0.51 2 3 4 5 6
for all entity mentions in cluster c. Note that we
log(Rank) can calculate the numerator (the intersection) by
looking up the n-grams of e in the hash and count-
Figure 1: Frequency vs. rank for 567k people,
ing matches with c. The denominator is equivalent
136k organizations, and 25k locations in the New
to the number of n-grams unique to e and to c plus
York Times Annotated Corpus (Sandhaus, 2008).
the number that are shared. The number that are
shared is the intersection. The number unique to
tains basic cluster information in memory (see be- e is the total number of n-grams in e minus the in-
low). Doing so improves paging performance as tersection. The final term, the number unique to c,
observed in Omiecinski and Scheuermann (1984). is computed by taking the total number of n-grams
Motivated by the Zipfian distribution of named en- in c (a single integer stored in memory) minus the
tities in news sources (Figure 1), we organize our intersection.
cluster store using an LRU policy, which facili- Through this strategy, we can select only those
tates easy access to named entities that were ob- clusters that have the highest orthographic simi-
served in the recent past. We obtain additional larity to e without requiring the cluster contents,
performance gains by hashing the clusters based which may not be stored in memory. In our exper-
on the constituent mention string (details below). iments, we evaluate settings where we select all
This allows us to quickly retrieve a small but re- candidates with non-zero score and a pruned set of
lated number of clusters, k. It is always the case the top k dice score candidates. We also include
that k << m, the current number of clusters. in the n-gram list known aliases to facilitate or-
thographically dissimilar, but reasonable matches
3.2 Candidate Cluster Selection (e.g., IBM or ‘Big Blue’ for ‘International Busi-
ness Machines, Inc.’).4
As part of any clustering algorithm, each new item
must be compared against current clusters. As we For further efficiency, we keep separate caches
see more documents, the number of unique clus- for each named entity type.5 We then select the
ters (entities) grows. Therefore, we need efficient appropriate cache based on the automatically de-
methods to select candidate clusters. termined type of the named entity provided by the
To select the top candidate clusters, we obtain named entity tagger, which also prevents spurious
those that have high orthographic similarity with matches of non-matching entity types.
the head name mention in the coreference chain e.
3.3 Similarity Metric
We compute this similarity using the dice score on
either word unigrams or character skip bigrams. After filtering by orthographic information to
For each entity mention string associated with a quickly obtain a small set of candidate clusters,
cluster c, we generate all possible n-grams using a full similarity score is computed for the current
one of the above two policies. We then index the 4
We generated alias lists for entities from Freebase.
cluster by each of its n-grams in a hash maintained 5
Persons (PER), organizations (ORG), and locations
in memory. In addition, we keep the number of n- (LOC).
entity coreference chain and each retrieved can- Attribute smith nytac ace08 kbp09
Total Documents 197 1.85M 10k 1.2M
didate cluster. These computations require infor- Annotated Docs 197 19,360 415 **
mation about each cluster, so the cluster’s suffi- Annotated Entities 35 200 3,943 **
cient statistics are loaded using the LRU cache de-
Table 1: Data sets used in our experiments. For
scribed above.
the kbp09 data we did not have annotations.
We define several similarity metrics between
coreference chains and clusters to deal with both
name variation and disambiguation. For name ground truth entity clusters, was useful for com-
variation, we define an orthographic similarity puting other performance statistics. Properties for
metric to match similar entity mention strings. As each data set are given in Table 1.
before, we use word unigrams and character skip
bigrams. For each of these methods, we compute 4.1 John Smith corpus
a similarity score as dice(e, c) and select the high- Bagga and Baldwin (1998) evaluated their disam-
est scoring cluster. biguation system on a set of 197 articles from the
To address name disambiguation, we use two New York Times that mention a person named
types of context from the document. First, we use ’John Smith’. This data exhibits no name variants
lexical features represented as TF/IDF weighted and is strictly a disambiguation task. We include
vectors. Second, we consider topic features, in this data (smith) to allow comparison to previous
which each word in a document is replaced with work.
the topic inferred from a topic model. This yields
a distribution over topics for a given document. 4.2 NYTAC Pseudo-name corpus
We use an LDA (Blei et al., 2003) model trained To study the effects of word sense ambiguity
on the New York Times Annotated Corpus (Sand- and disambiguation several researchers have ar-
haus, 2008). We note that LDA can be computed tificially conflated dissimilar words together and
over streams (Yao et al., 2009). then attempted to disambiguate them (Sanderson,
To compare context vectors we use cosine sim- 2000). The obvious advantage is cheaply obtained
ilarity, where the cluster vector is the average of ground truth for disambiguation.
all document vectors assigned to the cluster. Note The same trick has also been employed in per-
that the filtering step in Section 3.2 returns only son name disambiguation (Mann and Yarowsky,
those candidates with some orthographic similar- 2003; Gooi and Allan, 2004). We adopt the same
ity with the coreference chain, so a similarity met- method on a somewhat larger scale using annota-
ric that uses context only is still restricted to ortho- tions from the New York Times Annotated Corpus
graphically similar entities. (NYTAC) (Sandhaus, 2008), which annotates doc-
Finally, we consider a combination of ortho- uments based on whether or not they mention an
graphic and context similarity as a linear combi- entity. The NYTAC data contains documents from
nation of the two metrics as: 20 years of the New York Times and contains rich
metadata and document-level annotations that in-
score(e, c) = α dice(e, c) + (1 − α)cosine(e, c) . dicate when an entity is mentioned in the docu-
ment using a standard lexicon of entities. (Note
We set α = 0.8 based on initial experiments.
that mention strings are not tagged.) Using these
annotations we created a set of 100 pairs of con-
4 Evaluation
flated person names.
We used several corpora to evaluate our meth- The names were selected to be medium fre-
ods, including two data sets commonly used in the quency (i.e., occurring in between 50 and 200 ar-
coreference community. We also created a new ticles) and each pair matches in gender. The first
test set using artificially conflated names. And fi- 50 pairs are for names that are topically similar,
nally to test scalability, we ran our algorithm over for example, Tim Robbins and Tom Hanks (both
a large text collection that, while it did not have actors); Barbara Boxer and Olympia Snowe (both
smith nytac ace08
Approach P R F P R F P R F
Baseline 1.000 0.178 0.302 1.000 0.010 0.020 1.000 0.569 0.725
ExactMatch 0.233 1.000 0.377 0.563 0.897 0.692 0.977 0.697 0.814
Ortho 0.603 0.629 0.616 0.611 0.784 0.687 0.975 0.694 0.811
BoW 0.956 0.367 0.530 0.930 0.249 0.349 0.989 0.589 0.738
Topic 0.847 0.592 0.697 0.815 0.244 0.363 0.983 0.605 0.750
Ortho+BoW 0.603 0.634 0.618 0.801 0.601 0.686 0.976 0.691 0.809
Ortho+Topic 0.603 0.634 0.618 0.800 0.591 0.680 0.975 0.704 0.819
Table 2: Best B 3 performance on the smith, nytac, and ace08 test sets.
US politicians). We imagined that this would be (e.g., “Osama Bin Laden” / “Usama Bin Laden”
a more challenging subset because of presumed or “Mark Twain” / “Samuel Clemens”) as well as
lexical overlap. The second set of 50 name pairs name disambiguation. Furthermore, the task re-
were arbitrarily conflated. We sub-selected the quired detection of when no appropriate KB entry
data to ensure that no two entities in our collec- exists, which is a departure from the conventional
tion co-occur in the same document and this left disambiguation problem. The collection contains
us with 19,360 documents for which ground-truth over 1.2 million documents, primarily newswire.
was known. In each document we rewrote the Wikipedia was used as a surrogate knowledge
conflated name mentions using a single gender- base, and it has been used in several previous stud-
neutral name; any middle initials or names were ies (e.g., Cucerzan (2007)). This task is closely re-
discarded. lated to CDCR, as mentions that are aligned to the
same knowledge base entry create a coreference
4.3 ACE 2008 corpus cluster. However, there are no actual CDCR anno-
The NIST ACE 2008 (ace08) evaluation studied tations for this corpus, though we used it nonethe-
several related technologies for information ex- les as a benchmark corpus to evaluate speed and
traction, including named-entity recognition, re- to demonstrate scalability.
lation extraction, and cross-document coreference
5 Discussion
for person names in both English and Arabic. Ap-
proximately 10,000 documents from several gen- 5.1 Accuracy
res (predominantly newswire) were given to par- In Table 2 we report cross document coreference
ticipants, who were expected to cluster person and resolution performance for a variety of experi-
organization entities across the entire collection. mental conditions using the B 3 method, which
However, only a selected set of about 400 docu- includes precision, recall, and calculated Fβ=1
ments were annotated and used to evaluate sys- values. For each of the three evaluation corpora
tem performance. Baron and Freedman (2008) (smith, nytac, and ace08) we report values for two
describe their work in this evaluation, which in- baseline methods and for similarity metrics us-
cluded a separate task for within-document coref- ing different types of features. The first baseline,
erence. called Baseline, places each coreference chain in
its own cluster while the second baseline, called
4.4 TAC-KBP 2009 corpus ExactMatch, merges all mentions that match ex-
The NIST TAC 2009 Knowledge Base Popula- actly orthographically into the same cluster.
tion track (kbp09) (McNamee and Dang, 2009) Use of name similarity scores as features (in ad-
conducted an evaluation of a system’s ability to dition to their use for candidate cluster selection)
link entity mentions to corresponding Wikipedia- is indicated by rows labeled Ortho. Use of lexi-
derived knowledge base nodes. The TAC-KBP cal features is indicated by BoW and use of topic
task focused on ambiguous person, organization, model features by Topic.
and geo-political entities mentioned in newswire, Using topic models as features was more help-
and required systems to cope with name variation ful than lexical contexts on the smith corpus.
90K 90000 27389 27358 64140 56581 27145 27126
100K 100000 29797 29782 71034 62228 29546 29511
110K 110000 32147 32139 77705 67853 31882 31840
120K 120000 34567 34589 84309 73397 34284 34235
130K 130000 36817 36874 90465 78676 36543 36486
140K 140000 38826 38901 96225 83525 38539 38482
1 5 10 20
150000 1000
2 0 0 1
2 0 0 1
2 0 0 1
# of clusters produced
2 0 112500
0 1
100
Time (secs)
2 2 0 4
2 2 0 4
2 2 75000
0 4
2 2 0 4
10
2 2 0 4
2 2 37500
0 4
2 2 0 4
2 2 0 5
1
2 2 00 6 1000 25000 49000 73000 97000 121000
2 2 0 1K 20K 6 40K 60K 80K 100K 120K 140K
# of chains processed
2 2 0 6 # of entity chains seen
1 5 10 20
2 2 0 6
2 2 0 Baseline
6 ExactMatch
2 2 0 Ortho 6 BOW Figure 3: Elapsed processing time as a function of
2 2 1 Topics 7 Ortho+BOW
2 2 1 7
Ortho+Topics bounding the number of candidate clusters consid-
2 2 2 7
2 2 2 7
ered for an entity. When fewer candidates are con-
2 2 3 7 sidered, clustering decisions can be made much
2
Figure 2: Number of clusters produced vs. num-
2 3 7
2 ber of entity chains observed in the stream. Num-
2 3 7 faster.
2 2 3 7
2 ber of entity chains is proportional to the number
2 4 8
2 2
of documents. 4 9
2 2 5 10 document stream. We experimented with using an
2 2 8 11
2 2 8 12
upper bound on the number of candidate clusters
2 2 8
When used alone topic beats BoW, but in com- 13 to consider for an entity.
2 2 8 14
2 bination with the ortho features performance is
2 8 15 Figure 4 compares the efficiency of using three
2 2 8 16 different methods for candidate cluster identifica-
2 equivalent. For both nytac and ace08 heavy re-
2 8 17
2 2 8
liance on orthographic similarity proved hard to18 tion. The most restrictive hashing strategy, using
2 2 9 19
2 beat. On the ace08 corpus Baron and Freedman
2 9 20
exact mention strings, is the most efficient, fol-
2 2 9 3 21
(2008) report B F-scores of 83.2 for persons and lowed by the use of words, then the use of charac-
2 2 10 22
2 67.8 for organizations, and our streaming results
2 10 23 ter skip bigrams. This makes intuitive sense – the
2 2 11 24 strictest matching reduces the number of candi-
2
appear to be comparable to their offline method.
2 12 25
2 2 The cluster growth induced by the various mea-
12 26 date clusters that have to be considered when pro-
2 2 13 27
2 sures can be seen in Figure 2. The two base-
3 14 28
cessing an entity.6
2 3 15
line methods, Baseline and ExactMatch, provide 29 The ace08 corpus contained over 10,000 doc-
2 4 16 30
2 bounds on the cluster growth with all other meth-
5 17 32
uments and is one of the largest CDCR test sets.
2 6
ods falling in between. 18 34 In Figure 5 we show how processing time grows
2 7 19 36
2 7 20 38 when processing the kbp09 corpus. Doubling the
2 5.2
7 Hashing Strategies for Candidate
21 40
number of entities processed increases the runtime
2 8 22 41
2 8 Selection
23 43 by about a factor of 5. The curve is not linear
2 9 24
345
Table 3 contains B F-scores when different hash- due to the increasing number of entity cluster’s
2 10 25 47
2 10 26
ing strategies are employed for candidate selec-48 that must be considered. Future work will exam-
2 11 27 50
2 tion. The trend appears to be that stricter match-
12 28 52 ine how to keep the number of clusters considered
2 13 29 54 constant over time, such as ignoring older entities.
2
ing outperforms fuzzier matching; full mentions
13 30 56
2 tended to beat words, which beat use of the char-
14 31 58
2 15 32
acter bigrams. This agrees with the results de-
60 6 Conclusion
3 16 33 62
3 17 34 63
scribed in the previous section, which show heavy
3 18 35 65
We have presented a new streaming cross doc-
3 reliance on orthographic similarity.
19 36 67 ument coreference resolution system. Our ap-
3 19 37 68
4 20 38 70 proach is substantially faster than previous sys-
4
5.3
21
Timing Results
39 72
6
4 22 40 73 In the limit, if names were unique, hashing on strings
5 Figure 3 shows how processing time increases
23 41 75 would completely solve the CDCR problem and processing
6 24 42 77
with the number of entities observed in the ace08 an entity would be O(1)
6 25 43 79
6 26 44 81
6 27 45 83
7 28 46 85
7 29 47 87
8 30 48 89
8 31 49 91
9 32 50 93
9 33 51 95
9 34 52 97
9 35 53 99
10 36 54 101
10 37 55 102
10 38 56 104
smith nytac ace08
Approach bigrams words mention bigrams words mention bigrams words mention
Ortho 0.382 0.553 0.616 0.120 0.695 0.687 0.540 0.797 0.811
BoW 0.480 0.530 0.467 0.344 0.339 0.349 0.551 0.700 0.738
Topic 0.697 0.661 0.579 0.071 0.620 0.363 0.544 0.685 0.750
Ortho+BoW 0.389 0.554 0.618 0.340 0.691 0.686 0.519 0.783 0.809
Ortho+Topic 0.398 0.555 0.618 0.120 0.477 0.680 0.520 0.776 0.819
Table 3: B 3 F-scores using different hashing strategies for candidate selection. Name/cluster similarity
could be based on character skip bigrams, words appear in names, or exact matching of mention.
bigram 10000
1000
1 1
2 5
2 6 1000
4 7
Time (secs)
100
Time (secs)
5 9
6 11 100
6 13
7 15
10 10
7 17
7 19
9 21
10 23 1
1 1K 100K 200K 400K 600K 900K 1.1M
11 25 1000 25000 49000 73000 97000 121000
13 27 # of coref chains processed
# of coref chains processed
14 29
15 31
mention string word bigram
16 33 Figure 5: The number of coreference chains pro-
17 36
18 39
cessed over time in the kbp09 corpus. The pro-
19 42 Figure 4: Comparison of three hashing strategies cessing of over 1 million coreference chains is at
21 44
23 47
for identifying candidate clusters for a given en- least an order of magnitude larger than previous
24 50 tity. The more restrictive strategies lead to faster systems reported.
26 53
28 56 processing as fewer candidates are considered.
30 60
32 64
34 68 pages 79–85, Morristown, NJ, USA. Association for
36 71 Computational Linguistics.
39 75
tems, and our experiments have demonstrated
40 79 scalability to an order of magnitude larger data
43 83 Baron, Alex and Marjorie Freedman. 2008. Who
45 88
than previously published evaluations. Despite its is Who and What is What: Experiments in cross-
47 92 speed and simplicity, we still obtain competitive document co-reference. In Proceedings of the
49 96
51 100 results on a variety of data sets as compared with 2008 Conference on Empirical Methods in Nat-
53 104
batch systems. In future work, we plan to investi- ural Language Processing, pages 274–283, Hon-
55 108 olulu, Hawaii, October. Association for Computa-
57 112 gate additional similarity metrics that can be com- tional Linguistics.
59 117
63 121
puted efficiently, as well as experiments on web
66 125 scale corpora. Blei, D.M., A.Y. Ng, and M.I. Jordan. 2003. Latent
70 129 dirichlet allocation. Journal of Machine Learning
73 133
77 137
Research (JMLR), 3:993–1022.
80 142
84 146 References Can, F. and E. Ozkarahan. 1987. A dynamic clus-
87 150
ter maintenance system for information retrieval. In
90 154 Artiles, Javier, Satoshi Sekine, and Julio Gonzalo.
94 158 SIGIR ’87: Proceedings of the 10th annual interna-
98 162
2008. Web people search: results of the first eval- tional ACM SIGIR conference on Research and de-
01 167 uation and the plan for the second. In WWW-2008, velopment in information retrieval, pages 123–131,
04 171 pages 1071–1072.
07 175
New York, NY, USA. ACM.
11 179
15 183 Bagga, Amit and Breck Baldwin. 1998. Entity- Charikar, Moses, Chandra Chekuri, Tomás Feder, and
19 187
based cross-document coreferencing using the vec- Rajeev Motwani. 1997. Incremental clustering and
22 192
26 196 tor space model. In Proceedings of the 17th Inter- dynamic information retrieval. In STOC ’97: Pro-
30 200 national Conference on Computational linguistics, ceedings of the twenty-ninth annual ACM sympo-
34 204
39 208
43 212
46 217
50 222
55 227
58 232
63 237
67 242
70 247
73 252
78 257
83 262
sium on Theory of computing, pages 626–635, New Pradhan, S.S., L. Ramshaw, R. Weischedel,
York, NY, USA. ACM. J. MacBride, and L. Micciulla. 2007. Unre-
stricted coreference: Identifying entities and events
Chen, Ying and James Martin. 2007. Towards ro- in ontonotes. In International Conference on
bust unsupervised personal name disambiguation. Semantic Computing (ICSC).
In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Process- Ramshaw, L. and R. Weischedel. 2005. Information
ing and Computational Natural Language Learning extraction. In IEEE ICASSP.
(EMNLP-CoNLL), pages 190–198.
Sanderson, Mark. 2000. Retrieving with good sense.
Choi, Y. and C. Cardie. 2007. Structured local training Information Retrieval, 2(1):45–65.
and biased potential functions for conditional ran-
dom fields with application to coreference resolu- Sandhaus, Evan. 2008. The new york times annotated
tion. In Proceedings of NAACL HLT, pages 65–72. corpus. Linguistic Data Consortium, Philadelphia.
Cucerzan, Silviu. 2007. Large-scale named entity dis- Soon, Wee Meng, Hwee Tou Ng, and Daniel
ambiguation based on wikipedia data. In EMNLP, Chung Yong Lim. 2001. A machine learning ap-
pages 708–716. proach to coreference resolution of noun phrases.
Computational Linguistics.
Firth, J.R. 1957. A synopsis of linguistic theory 1930-
1955. In Studies in Linguistic Analysis, pages 1–32. Yao, L., D. Mimno, and A. McCallum. 2009. Efficient
Oxford: Philological Society. methods for topic model inference on streaming
document collections. In Proceedings of the 15th
Gooi, Chung Heong and James Allan. 2004. Cross- ACM SIGKDD international conference on Knowl-
document coreference on a large scale corpus. In edge discovery and data mining, pages 937–946.
Susan Dumais, Daniel Marcu and Salim Roukos, ACM New York, NY, USA.
editors, HLT-NAACL 2004: Main Proceedings,
pages 9–16, Boston, Massachusetts, USA, May 2 -
May 7. Association for Computational Linguistics.
Luo, X., A. Ittycheriah, H. Jing, N. Kambhatla, and
S. Roukos. 2004. A mention-synchronous corefer-
ence resolution algorithm based on the bell tree. In
Proc. of the ACL, pages 136–143.
Mann, Gideon S. and David Yarowsky. 2003. Unsu-
pervised personal name disambiguation. In Daele-
mans, Walter and Miles Osborne, editors, Pro-
ceedings of CoNLL-2003, pages 33–40. Edmonton,
Canada.
McNamee, Paul and Hoa Dang. 2009. Overview of
the TAC 2009 knowledge base population track. In
Text Analysis Conference (TAC).
Ng, V. and C. Cardie. 2002. Improving machine learn-
ing approaches to coreference resolution. In Pro-
ceedings of the 40th Annual Meeting of the Associa-
tion for Computational Linguistics, pages 104–111.
Nicolae, C. and G. Nicolae. 2006. Bestcut: A graph
algorithm for coreference resolution. In Proceed-
ings of the 2006 Conference on Empirical Methods
in Natural Language Processing, pages 275–283.
Association for Computational Linguistics.
Omiecinski, Edward and Peter Scheuermann. 1984. A
global approach to record clustering and file reorga-
nization. In Proc. of the third joint BCS and ACM
symposium on Research and development in infor-
mation retrieval, pages 201–219, New York, NY,
USA. Cambridge University Press.

Streaming Cross Document Entity Coreference Resolution

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Streaming Cross Document Entity Coreference Resolution

Hochgeladen von

Copyright:

Verfügbare Formate

Streaming Cross Document Entity Coreference Resolution

Delip Rao and Paul McNamee and Mark Dredze

2.5 the hash. We then compute the dice score:

Das könnte Ihnen auch gefallen