Beruflich Dokumente
Kultur Dokumente
Table 2: Best B 3 performance on the smith, nytac, and ace08 test sets.
US politicians). We imagined that this would be (e.g., “Osama Bin Laden” / “Usama Bin Laden”
a more challenging subset because of presumed or “Mark Twain” / “Samuel Clemens”) as well as
lexical overlap. The second set of 50 name pairs name disambiguation. Furthermore, the task re-
were arbitrarily conflated. We sub-selected the quired detection of when no appropriate KB entry
data to ensure that no two entities in our collec- exists, which is a departure from the conventional
tion co-occur in the same document and this left disambiguation problem. The collection contains
us with 19,360 documents for which ground-truth over 1.2 million documents, primarily newswire.
was known. In each document we rewrote the Wikipedia was used as a surrogate knowledge
conflated name mentions using a single gender- base, and it has been used in several previous stud-
neutral name; any middle initials or names were ies (e.g., Cucerzan (2007)). This task is closely re-
discarded. lated to CDCR, as mentions that are aligned to the
same knowledge base entry create a coreference
4.3 ACE 2008 corpus cluster. However, there are no actual CDCR anno-
The NIST ACE 2008 (ace08) evaluation studied tations for this corpus, though we used it nonethe-
several related technologies for information ex- les as a benchmark corpus to evaluate speed and
traction, including named-entity recognition, re- to demonstrate scalability.
lation extraction, and cross-document coreference
5 Discussion
for person names in both English and Arabic. Ap-
proximately 10,000 documents from several gen- 5.1 Accuracy
res (predominantly newswire) were given to par- In Table 2 we report cross document coreference
ticipants, who were expected to cluster person and resolution performance for a variety of experi-
organization entities across the entire collection. mental conditions using the B 3 method, which
However, only a selected set of about 400 docu- includes precision, recall, and calculated Fβ=1
ments were annotated and used to evaluate sys- values. For each of the three evaluation corpora
tem performance. Baron and Freedman (2008) (smith, nytac, and ace08) we report values for two
describe their work in this evaluation, which in- baseline methods and for similarity metrics us-
cluded a separate task for within-document coref- ing different types of features. The first baseline,
erence. called Baseline, places each coreference chain in
its own cluster while the second baseline, called
4.4 TAC-KBP 2009 corpus ExactMatch, merges all mentions that match ex-
The NIST TAC 2009 Knowledge Base Popula- actly orthographically into the same cluster.
tion track (kbp09) (McNamee and Dang, 2009) Use of name similarity scores as features (in ad-
conducted an evaluation of a system’s ability to dition to their use for candidate cluster selection)
link entity mentions to corresponding Wikipedia- is indicated by rows labeled Ortho. Use of lexi-
derived knowledge base nodes. The TAC-KBP cal features is indicated by BoW and use of topic
task focused on ambiguous person, organization, model features by Topic.
and geo-political entities mentioned in newswire, Using topic models as features was more help-
and required systems to cope with name variation ful than lexical contexts on the smith corpus.
90K 90000 27389 27358 64140 56581 27145 27126
100K 100000 29797 29782 71034 62228 29546 29511
110K 110000 32147 32139 77705 67853 31882 31840
120K 120000 34567 34589 84309 73397 34284 34235
130K 130000 36817 36874 90465 78676 36543 36486
140K 140000 38826 38901 96225 83525 38539 38482
1 5 10 20
150000 1000
2 0 0 1
2 0 0 1
2 0 0 1
# of clusters produced
2 0 112500
0 1
100
Time (secs)
2 2 0 4
2 2 0 4
2 2 75000
0 4
2 2 0 4
10
2 2 0 4
2 2 37500
0 4
2 2 0 4
2 2 0 5
1
2 2 00 6 1000 25000 49000 73000 97000 121000
2 2 0 1K 20K 6 40K 60K 80K 100K 120K 140K
# of chains processed
2 2 0 6 # of entity chains seen
1 5 10 20
2 2 0 6
2 2 0 Baseline
6 ExactMatch
2 2 0 Ortho 6 BOW Figure 3: Elapsed processing time as a function of
2 2 1 Topics 7 Ortho+BOW
2 2 1 7
Ortho+Topics bounding the number of candidate clusters consid-
2 2 2 7
2 2 2 7
ered for an entity. When fewer candidates are con-
2 2 3 7 sidered, clustering decisions can be made much
2
Figure 2: Number of clusters produced vs. num-
2 3 7
2 ber of entity chains observed in the stream. Num-
2 3 7 faster.
2 2 3 7
2 ber of entity chains is proportional to the number
2 4 8
2 2
of documents. 4 9
2 2 5 10 document stream. We experimented with using an
2 2 8 11
2 2 8 12
upper bound on the number of candidate clusters
2 2 8
When used alone topic beats BoW, but in com- 13 to consider for an entity.
2 2 8 14
2 bination with the ortho features performance is
2 8 15 Figure 4 compares the efficiency of using three
2 2 8 16 different methods for candidate cluster identifica-
2 equivalent. For both nytac and ace08 heavy re-
2 8 17
2 2 8
liance on orthographic similarity proved hard to18 tion. The most restrictive hashing strategy, using
2 2 9 19
2 beat. On the ace08 corpus Baron and Freedman
2 9 20
exact mention strings, is the most efficient, fol-
2 2 9 3 21
(2008) report B F-scores of 83.2 for persons and lowed by the use of words, then the use of charac-
2 2 10 22
2 67.8 for organizations, and our streaming results
2 10 23 ter skip bigrams. This makes intuitive sense – the
2 2 11 24 strictest matching reduces the number of candi-
2
appear to be comparable to their offline method.
2 12 25
2 2 The cluster growth induced by the various mea-
12 26 date clusters that have to be considered when pro-
2 2 13 27
2 sures can be seen in Figure 2. The two base-
3 14 28
cessing an entity.6
2 3 15
line methods, Baseline and ExactMatch, provide 29 The ace08 corpus contained over 10,000 doc-
2 4 16 30
2 bounds on the cluster growth with all other meth-
5 17 32
uments and is one of the largest CDCR test sets.
2 6
ods falling in between. 18 34 In Figure 5 we show how processing time grows
2 7 19 36
2 7 20 38 when processing the kbp09 corpus. Doubling the
2 5.2
7 Hashing Strategies for Candidate
21 40
number of entities processed increases the runtime
2 8 22 41
2 8 Selection
23 43 by about a factor of 5. The curve is not linear
2 9 24
345
Table 3 contains B F-scores when different hash- due to the increasing number of entity cluster’s
2 10 25 47
2 10 26
ing strategies are employed for candidate selec-48 that must be considered. Future work will exam-
2 11 27 50
2 tion. The trend appears to be that stricter match-
12 28 52 ine how to keep the number of clusters considered
2 13 29 54 constant over time, such as ignoring older entities.
2
ing outperforms fuzzier matching; full mentions
13 30 56
2 tended to beat words, which beat use of the char-
14 31 58
2 15 32
acter bigrams. This agrees with the results de-
60 6 Conclusion
3 16 33 62
3 17 34 63
scribed in the previous section, which show heavy
3 18 35 65
We have presented a new streaming cross doc-
3 reliance on orthographic similarity.
19 36 67 ument coreference resolution system. Our ap-
3 19 37 68
4 20 38 70 proach is substantially faster than previous sys-
4
5.3
21
Timing Results
39 72
6
4 22 40 73 In the limit, if names were unique, hashing on strings
5 Figure 3 shows how processing time increases
23 41 75 would completely solve the CDCR problem and processing
6 24 42 77
with the number of entities observed in the ace08 an entity would be O(1)
6 25 43 79
6 26 44 81
6 27 45 83
7 28 46 85
7 29 47 87
8 30 48 89
8 31 49 91
9 32 50 93
9 33 51 95
9 34 52 97
9 35 53 99
10 36 54 101
10 37 55 102
10 38 56 104
smith nytac ace08
Approach bigrams words mention bigrams words mention bigrams words mention
Ortho 0.382 0.553 0.616 0.120 0.695 0.687 0.540 0.797 0.811
BoW 0.480 0.530 0.467 0.344 0.339 0.349 0.551 0.700 0.738
Topic 0.697 0.661 0.579 0.071 0.620 0.363 0.544 0.685 0.750
Ortho+BoW 0.389 0.554 0.618 0.340 0.691 0.686 0.519 0.783 0.809
Ortho+Topic 0.398 0.555 0.618 0.120 0.477 0.680 0.520 0.776 0.819
Table 3: B 3 F-scores using different hashing strategies for candidate selection. Name/cluster similarity
could be based on character skip bigrams, words appear in names, or exact matching of mention.
bigram 10000
1000
1 1
2 5
2 6 1000
4 7
Time (secs)
100
Time (secs)
5 9
6 11 100
6 13
7 15
10 10
7 17
7 19
9 21
10 23 1
1 1K 100K 200K 400K 600K 900K 1.1M
11 25 1000 25000 49000 73000 97000 121000
13 27 # of coref chains processed
# of coref chains processed
14 29
15 31
mention string word bigram
16 33 Figure 5: The number of coreference chains pro-
17 36
18 39
cessed over time in the kbp09 corpus. The pro-
19 42 Figure 4: Comparison of three hashing strategies cessing of over 1 million coreference chains is at
21 44
23 47
for identifying candidate clusters for a given en- least an order of magnitude larger than previous
24 50 tity. The more restrictive strategies lead to faster systems reported.
26 53
28 56 processing as fewer candidates are considered.
30 60
32 64
34 68 pages 79–85, Morristown, NJ, USA. Association for
36 71 Computational Linguistics.
39 75
tems, and our experiments have demonstrated
40 79 scalability to an order of magnitude larger data
43 83 Baron, Alex and Marjorie Freedman. 2008. Who
45 88
than previously published evaluations. Despite its is Who and What is What: Experiments in cross-
47 92 speed and simplicity, we still obtain competitive document co-reference. In Proceedings of the
49 96
51 100 results on a variety of data sets as compared with 2008 Conference on Empirical Methods in Nat-
53 104
batch systems. In future work, we plan to investi- ural Language Processing, pages 274–283, Hon-
55 108 olulu, Hawaii, October. Association for Computa-
57 112 gate additional similarity metrics that can be com- tional Linguistics.
59 117
63 121
puted efficiently, as well as experiments on web
66 125 scale corpora. Blei, D.M., A.Y. Ng, and M.I. Jordan. 2003. Latent
70 129 dirichlet allocation. Journal of Machine Learning
73 133
77 137
Research (JMLR), 3:993–1022.
80 142
84 146 References Can, F. and E. Ozkarahan. 1987. A dynamic clus-
87 150
ter maintenance system for information retrieval. In
90 154 Artiles, Javier, Satoshi Sekine, and Julio Gonzalo.
94 158 SIGIR ’87: Proceedings of the 10th annual interna-
98 162
2008. Web people search: results of the first eval- tional ACM SIGIR conference on Research and de-
01 167 uation and the plan for the second. In WWW-2008, velopment in information retrieval, pages 123–131,
04 171 pages 1071–1072.
07 175
New York, NY, USA. ACM.
11 179
15 183 Bagga, Amit and Breck Baldwin. 1998. Entity- Charikar, Moses, Chandra Chekuri, Tomás Feder, and
19 187
based cross-document coreferencing using the vec- Rajeev Motwani. 1997. Incremental clustering and
22 192
26 196 tor space model. In Proceedings of the 17th Inter- dynamic information retrieval. In STOC ’97: Pro-
30 200 national Conference on Computational linguistics, ceedings of the twenty-ninth annual ACM sympo-
34 204
39 208
43 212
46 217
50 222
55 227
58 232
63 237
67 242
70 247
73 252
78 257
83 262
sium on Theory of computing, pages 626–635, New Pradhan, S.S., L. Ramshaw, R. Weischedel,
York, NY, USA. ACM. J. MacBride, and L. Micciulla. 2007. Unre-
stricted coreference: Identifying entities and events
Chen, Ying and James Martin. 2007. Towards ro- in ontonotes. In International Conference on
bust unsupervised personal name disambiguation. Semantic Computing (ICSC).
In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Process- Ramshaw, L. and R. Weischedel. 2005. Information
ing and Computational Natural Language Learning extraction. In IEEE ICASSP.
(EMNLP-CoNLL), pages 190–198.
Sanderson, Mark. 2000. Retrieving with good sense.
Choi, Y. and C. Cardie. 2007. Structured local training Information Retrieval, 2(1):45–65.
and biased potential functions for conditional ran-
dom fields with application to coreference resolu- Sandhaus, Evan. 2008. The new york times annotated
tion. In Proceedings of NAACL HLT, pages 65–72. corpus. Linguistic Data Consortium, Philadelphia.
Cucerzan, Silviu. 2007. Large-scale named entity dis- Soon, Wee Meng, Hwee Tou Ng, and Daniel
ambiguation based on wikipedia data. In EMNLP, Chung Yong Lim. 2001. A machine learning ap-
pages 708–716. proach to coreference resolution of noun phrases.
Computational Linguistics.
Firth, J.R. 1957. A synopsis of linguistic theory 1930-
1955. In Studies in Linguistic Analysis, pages 1–32. Yao, L., D. Mimno, and A. McCallum. 2009. Efficient
Oxford: Philological Society. methods for topic model inference on streaming
document collections. In Proceedings of the 15th
Gooi, Chung Heong and James Allan. 2004. Cross- ACM SIGKDD international conference on Knowl-
document coreference on a large scale corpus. In edge discovery and data mining, pages 937–946.
Susan Dumais, Daniel Marcu and Salim Roukos, ACM New York, NY, USA.
editors, HLT-NAACL 2004: Main Proceedings,
pages 9–16, Boston, Massachusetts, USA, May 2 -
May 7. Association for Computational Linguistics.
Luo, X., A. Ittycheriah, H. Jing, N. Kambhatla, and
S. Roukos. 2004. A mention-synchronous corefer-
ence resolution algorithm based on the bell tree. In
Proc. of the ACL, pages 136–143.
Mann, Gideon S. and David Yarowsky. 2003. Unsu-
pervised personal name disambiguation. In Daele-
mans, Walter and Miles Osborne, editors, Pro-
ceedings of CoNLL-2003, pages 33–40. Edmonton,
Canada.
McNamee, Paul and Hoa Dang. 2009. Overview of
the TAC 2009 knowledge base population track. In
Text Analysis Conference (TAC).
Ng, V. and C. Cardie. 2002. Improving machine learn-
ing approaches to coreference resolution. In Pro-
ceedings of the 40th Annual Meeting of the Associa-
tion for Computational Linguistics, pages 104–111.
Nicolae, C. and G. Nicolae. 2006. Bestcut: A graph
algorithm for coreference resolution. In Proceed-
ings of the 2006 Conference on Empirical Methods
in Natural Language Processing, pages 275–283.
Association for Computational Linguistics.
Omiecinski, Edward and Peter Scheuermann. 1984. A
global approach to record clustering and file reorga-
nization. In Proc. of the third joint BCS and ACM
symposium on Research and development in infor-
mation retrieval, pages 201–219, New York, NY,
USA. Cambridge University Press.