Sie sind auf Seite 1von 10

ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

000 050
001
Learning to Bootstrap Entity Set Expansion via Deep Similarity Network 051
002 and Monte Carlo Tree Search 052
003 053
004 054
005 055
Anonymous ACL submission
006 056
007 057
008 058
009 059
010 060
011 Seed Entities 061
Abstract * is a big city the US Embassy in *
012 London, Beijing, Paris 062
013 Bootstrapping for Entity Set Expansion (ESE) 063
014 aims at iteratively acquiring new instances of Extracted Entities Extracted Entities 064
015 a specific target category. However, tradi- Rome, Tokyo, Berlin,
P=0.8(4/5)
Baghdad, Moscow, P=0.8(4/5) 065
Sydney, Moscow Nairobi, Seoul, France
tional bootstrapping ESE often suffers from
016 066
two problems: 1) delayed feedback, where the the meeting held in * the Japanese
017 Embassy in * 067
pattern evaluation relies on both its direct ex-
018 traction quality and extraction quality in later 068
019 iterations; 2) sparse supervision, where the su- will be held in * * is the capital of 069
020 pervision signals are often a few seed enti- 070
ties, which makes entity scoring difficult. To Extracted Entities Extracted Entities
021 Amsterdam, Prague, 071
address the above two problems, we propose Room, June, May, P=0.2(1/5) P=0.8(4/5)
Ottawa, Brussels,
022 September, Washington 072
Florence
a deep similarity network-based bootstrapping
023 method combined with the Monte Carlo Tree 073
Figure 1: Example of two patterns ”* is a big city”
024 Search (MCTS) algorithm, which can effi- 074
and ”the US Embassy in *” used for expanding capital
025 ciently estimate delayed feedback for pattern seeds, where two patterns have the same instant feed- 075
026 evaluation and adaptively score entities given back but different delayed feedback. We demonstrate 076
sparse supervision signals. Experimental re- top entities extracted by a pattern (true ones are shown
027 077
sults verified the effectiveness of our method. in bold) due to the page limitation; P is the precision
028 078
029
of extracted entities. 079
1 Introduction
030 080
031 Bootstrapping is a classical technique for Entity 081
Set Expansion (ESE), which acquires new in- in Figure 1, although ”* is a big city” and ”the
032 082
stances of a specific category by iteratively evalu- US Embassy in *” have the same direct extrac-
033 083
ating and selecting patterns, extracting and scoring tion quality, the former is still worse than the latter
034 084
entities. For example, given seeds {London, Paris, since its later extracted entities are mostly unre-
035 085
Beijing} for capital entity expansion, a bootstrap- lated. Selecting patterns with high instant feed-
036 086
ping ESE system iteratively selects effective pat- back but low delayed feedback can cause seman-
037 tic drift problem (Curran et al., 2007), where ex- 087
terns, e.g., ”the US Embassy in *”, and extracts
038 tracted entities in later iterations belong to other 088
other capital entities, e.g., Moscow.
039 categories. Secondly, the above difficulty is fur- 089
The main challenges of effective bootstrapping
040 ther compounded by sparse supervision, i.e., using 090
for ESE owe to the delayed feedback and the
041
sparse supervision. Firstly, bootstrapping is an only seed entities as supervision, since it provides 091
042 iterative process, where noisy inclusion by cur- little evidence to decide whether an extracted en- 092
043 rently selected patterns can affect successive itera- tity belongs to the same category or not. 093
044 tions (Movshovitz-Attias and Cohen, 2012; Qadir Currently, most previous studies evaluate pat- 094
045 et al., 2015); conversely, the pattern evaluation re- terns mainly based on its direct extraction fea- 095
046 lies on not only its direct extraction quality but also tures, e.g., the matching statistics with known 096
047 the extraction quality in later iterations, which are entities (Riloff and Jones, 1999). To avoid se- 097
048 called instant feedback and delayed feedback cor- mantic drifting, most of them exploit extra con- 098
049 respondingly in this paper. For instance, as shown straints, such as parallel multiple categories (The- 099

1
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

100 len and Riloff, 2002; Yangarber, 2003), nega- feedback in bootstrapping. To our best 150
101 tive samples (Yangarber et al., 2002; McIntosh, knowledge, this is the first time to combine 151
102 2010; Shi et al., 2014), and mutual exclusion boot- bootstrapping with the MCTS algorithm. 152
103 strapping (Curran et al., 2007; McIntosh and Cur- 2) We propose a novel deep similarity network, 153
104 ran, 2008). These constraints can also be re- which can adaptively evaluate different cate- 154
105 gard as undirected delayed feedback, since pat- gories of entities in Entity Set Expansion. 155
106 terns violating constraints often have low de- This paper is organized as follows: In Section 2, 156
107 layed feedback. To address the sparse supervi- we describe how to enhance bootstrapping via the 157
108 sion challenge, most previous studies score enti- MCTS algorithm. In Section 3, we describe our 158
109 ties by leveraging statistical features inner boot- pattern mover similarity network in detail. Sec- 159
110 strapping process (Riloff and Jones, 1999; Steven- tion 4 describes the experiments. Section 5 briefly 160
111
son and Greenwood, 2005; Pantel and Pennac- reviews related work and Section 6 concludes this 161
112
chiotti, 2006; McIntosh and Curran, 2008; Pantel paper. 162
113
et al., 2009), which often fails since sparse statis- 163
tical features provide little semantic information 2 Enhancing Bootstrapping via Monte
114 164
to evaluate entities; some recent studies use fixed Carlo Tree Search
115 165
word embeddings as external resources and eval-
116 In this section, we describe how to enhance tra- 166
uate entities by their similarity to seeds. (Batista
117
et al., 2015; Gupta and Manning, 2015); however, ditional bootstrapping ESE using the MCTS algo- 167
118
the fixed embeddings cannot preciously represent rithm. By using the MCTS algorithm, our method 168
119
underground semantics of seed entities in given can estimate the delayed feedback for each pattern 169
120
corpus since they are often pre-trained on other through multi-step lookahead search and select the 170
121 corpus; Recently, Berger et al. (2018) learn cus- patterns with the highest delayed feedback. 171
122 tom embeddings through the bootstrapping pro- 172
123
2.1 Traditional Bootstrapping 173
cess, but its learning cost is still too much.
124 A traditional bootstrapping ESE system is usually 174
125 In this paper, we propose a deep similarity provided sparse supervision, i.e., a few seed enti- 175
126 network-based bootstrapping method, which can ties, and iteratively extracts new entities from cor- 176
127 adaptively evaluate patterns and score entities by pus by iteratively performing the following steps 177
128 combining Monte Carlo Tree Search (MCTS) al- (see Figure 2a): 178
129 gorithm. Specifically, our method solves the de- Pattern generation. Given seed entities and 179
130
layed feedback problem by enhancing the tra- extracted entities to current (known entities), a 180
131
ditional bootstrapping method using the MCTS bootstrapping ESE system first generates patterns 181
132
algorithm, which effectively estimates each pat- from the corpus. In this paper, we use lexicon- 182
tern’s delayed feedback via efficient multi-step syntactic surface words around known entities as
133 183
lookahead search. In this way, our method can the patterns.
134 184
select the pattern based on its delayed feedback Pattern evaluation. This step evaluates gen-
135 185
rather than instant feedback, which is more reli- erated patterns using sparse supervision and other
136 186
able and accurate for bootstrapping. To resolve evidence. Many previous studies (Riloff and
137 187
the sparse supervision problem, we propose a deep Jones, 1999; Thelen and Riloff, 2002; Curran
138 188
similarity network—pattern mover similarity net- et al., 2007; Gupta and Manning, 2014) use the
139 189
work (PMSN), which uniformly embeds entities RlogF function or its variants to evaluate patterns.
140 and pattern by an adaptive distribution on context 190
Entity expansion. This step selects top patterns
141 pattern embeddings, and measures their semantic 191
to match new candidate entities from the corpus.
142 similarity to seed using a pattern mover similar- Entity scoring. This step scores candidate en- 192
143 ity measurement. We combine the PMSN with the tities using sparse supervision, bootstrapping ev- 193
144 MCTS, and fine-tune the distribution using esti- idences or other external resources. And the top 194
145 mated delayed feedback. In this way, our method entities will be added to the extracted entity set. 195
146 can adaptively embed and score entities in corpus. From the above steps, we can see that it is dif- 196
147 The contributions of our work are: ficult for traditional methods to estimate the de- 197
148 1) We enhance the traditional bootstrapping via layed feedback in the pattern evaluation step, since 198
149 the MCTS algorithm to estimate delayed it cannot estimate the delayed feedback of a pat- 199

2
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

200 MCTS Selection MCTS Expansion MCTS Evaluation 250


201 Seed Entities Extracted Entities 251
London, Paris, Beijing,... Moscow, Berlin, Tokyo,...
Reward
202 252
Pattern Reward
203 Extracted 253
Pattern
Pattern Generation
Generation Entity Scoring Mover
Entities
204 Similarity 254
Reward
New New State Network
205 Patterns Entities (Seed Entities & 255
Extracted Entities) Reward
206 Pattern Evaluation
Selected
Entity Expansion
256
Patterns
207 MCTS Backup 257
208 (a) Traditional bootstrapping system for En- (b) The Monte Carlo Tree Search for the pattern evaluation in a boot- 258
tity set expansion. The main difference be- strapping system for entity set expansion. The red circles refer to the
209 tween our method and traditional methods is search states (seed entities plus extracted entities) in the MCTS process. 259
210 that we enhance the pattern evaluation by the Edges linked from a left circle to a right circle represent to select a pat- 260
Monte Carlo Tree Search algorithm (see b in tern to match new entities from the left state, which results in a new
211 261
left) state.
212 262
213 Figure 2: Traditional Bootstrapping for Entity Set Expansion (a) and enhancing pattern evaluation using the MCTS 263
214 algorithm(b). 264
215 265
216 266
tern unless already selecting it. Therefore, to effi- never reached before) or reaching fixed depth.
217 267
ciently estimate the delayed feedback, we enhance Specifically, at each step i(i > 0) of this simu-
218 268
the pattern evaluation step using the classical lation, we select a new action pi from state s by:
219 269
lookahead search algorithm—Monte Carlo Tree pi = arg max Q(s, p) + µ(s, p) (1)
220 Search, which estimates the delayed feedback of p 270
221 each pattern by multi-step lookahead search. In pσ (s, p) 271
222
µ(s, p) ∝ (2) 272
this way, our method can directly select patterns 1 + N (s, p)
223 with the highest delayed feedback. 273
where pσ (s, p) is the prior probability of p re-
224 274
turned by the policy network pσ , which will be
225
2.2 Enhancing Pattern Evaluation via MCTS 275
described in detail below.
226 In this subsection, we describe how to estimate de- Expansion. When the traversal reaches a leaf 276
227 layed feedback of each pattern using the MCTS state at step L, current state sL is expanded, where 277
228 algorithm. a legal action will be selected and expand new en- 278
229 Formally, given seed entity set Es and the ex- tities to build a new state. We select the action 279
230 tracted entity set E to current, our method will based on each pattern’s prior probability returned 280
231 build a search tree by first constructing a root by the policy network pσ . 281
232
state as s0 = Es ∪ E. Starting from the root Evaluation. Once finishing expanding the leaf 282
233
state, our method performs several times simu- state to a new state or the search process reaches 283
lations to estimate the delayed feedback of pat- a certain depth, our method will quickly perform
234 284
terns; each simulation contains several steps, and several steps using the RlogF function (which re-
235 285
each step refers to selecting one pattern to match place the policy network for the quick pattern se-
236 286
new entities; once we obtain new entities after one lection); then, we evaluate the quality of final ex-
237 287
step, we add them to the extracted entity set to tracted new entities and return it as the reward R
238 288
get a new entity set E 0 , and build a new state as of this simulation.
239 289
s0 = Es ∪ E 0 . Besides, we build an edge (s, p) Backup. At the end of each simulation, we use
240 290
linking from state s to state s0 , which means se- the reward to update action values and visit counts
241
lecting pattern p from state s results in state s0 . To of all <state, pattern> pairs as:
291
242 estimate final delayed feedback, each edge stores n 292
X
243 an action value Q(s, p) and a visit count N (s, p) N (s, p) = 1(s, p, j) (3) 293
244 during tree search. Specifically, each MCTS sim- j 294
245 ulation in our method will perform the following n 295
1 X
246 four steps (see Figure 2b): Q(s, p) = 1(s, p, j) · R (4) 296
N (s, p)
247 Selection. Starting from the root state s0 , our j=1 297
248 method performs one simulation by traversing the where 1(s, p, j) indicates whether an edge (s, p) 298
249 MCTS search tree until the leaf state (which is was traversed during the j th simulation. 299

3
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

300 After finishing all MCTS simulations, the ac- sands), we firstly use the RlogF function to filter 350
301 tion values of all <state, pattern> pairs (s0 , p), out some less related patterns and only select the 351
302 can indicate the estimated delayed feedback of top k patterns to calculate their selection probabil- 352
303 each pattern at current iteration. Therefore, if ities. In our experiments, we set k as 200. 353
304 the reward function can preciously evaluate the 354
Reward Function in MCTS
305 extraction quality, the action values can be pre- 355
306 ciously estimated by the MCTS algorithm. And The reward function is critical for efficiently esti- 356
our method selects the top patterns with the high- mating the real delayed feedback of each pattern.
307 357
est action values and is more likely to extract reli- In this paper, we design the reward function as fol-
308 358
able new entities. lows:
309 359
|E 0 |
P
0 sim(e, E)
310 R = e∈E 0 sigmoid( ) (6) 360
Prior Policy using Pattern Mover Similarity |E | a
311 Network 361
where E is the set of known entities (seed enti-
312 ties plus extracted entities in previous iterations ) 362
A promising prior policy network is critical in
313
MCTS simulations, since there are many can- in root state, E 0 is the set of all new extracted enti- 363
314 didate patterns at each step, which makes final ties at the end of each simulation, sim(e, E) is the 364
315 search space extremely large. To prune bad pat- similarity score of new extracted entity e to known 365
316 terns, we use the pattern mover similarity network entities, a is a constant. We exploit pattern mover 366
317 as the prior policy network. similarity network to calculate the similairty score. 367
318 To give a prior evaluation of patterns, one way is 368
319 to exploit statistical features-based functions, e.g., 3 Pattern Mover Similarity Network 369
320 the RlogF function, like many traditional meth- In this section, we describe the pattern mover sim- 370
321 ods (Riloff and Jones, 1999; Pantel and Pennac- ilarity network (PMSN), which is a unified model 371
322 chiotti, 2006; McIntosh and Curran, 2008; Gupta for adaptively scoring the similarity of entities or 372
323 and Manning, 2014). However, these functions patterns to seed entities. Specifically, the pattern 373
324 are mainly designed for directly bootstrapping mover similarity network contains two compo- 374
325 rather than the MCTS simulations; in addition, nents: 1) the Adaptive Pattern Embeddings (APE) 375
326 since there is sparse supervision provided, statisti- which can adaptively represent patterns, entities, 376
327 cal features are often unreliable and depend heav- and entity sets in a unified way; 2) the pattern 377
328
ily on the seed selection. Another way is to ex- mover similarity (PMS) measurement which cal- 378
329
ploit embedding-based models like some recent culates the similarity of two APEs. 379
studies (Batista et al., 2015; Berger et al., 2018). The PMSN model is mainly used in three as-
330 380
Embedding-based models can leverage external pects: 1) the PMSN is used as the prior policy net-
331 381
resources, i.e., pre-trained wording embeddings, work in the MCTS algorithm to evaluate the simi-
332 382
to overcome the sparse supervision problems. larity of patterns. 2) the PMSN is used to evaluate
333 383
In this paper, we also exploit the embedding- the new extracted entities inner MCTS simulation,
334 384
based model as our policy network. Particu- whose evaluation scores will be used to calculate
335 385
larly, we propose a unified deep similarity network rewards. 3) the PMSN is also used as the entity
336 386
called pattern mover similarity network, which are scoring function at the Entity Scoring stage in the
337 387
correspondingly used as the prior policy network bootstrapping process as mentioned in 2.1.
338 when evaluating patterns and used in the reward 388
339 function when evaluating entities. We will de- 3.1 Adaptive Pattern Embedding 389
340 scribe this model in detail in Section 3. 390
In this subsection, we first describe how to embed
341 Specifically, to assign a prior probability to pat- 391
patterns; then, we introduce how to obtain initial
342 tern p from seach state s, we first use the pattern APE; finally, we introduce the probability adap- 392
343 mover similarity network to calculate its similar- tion mechanism for better entity representation. 393
344 ity sim(p, E) to all entities E in state s. then, we Pattern Embedding. As a basic step of our 394
345 calculate its prior probability pσ (s, p) as follows: PMSN model, we use distributional representa- 395
346 sim(p, E) tion techniques recently, e.g., word2vec (Mikolov 396
pσ (s, p) = P 0
(5)
347 p0 sim(p , E) et al., 2013), to embed context patterns. Specifi- 397
348 In addition, to reduce the pattern selection com- cally, we use the mean word embedding of a pat- 398
349 plexity (the number of patterns can be tens of thou- tern surface text as the pattern embedding. The 399

4
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

400 word embeddings are from Glove (Pennington w is the vector of distribution probabilities. Due 450
401 et al., 2014). And we filter out the pattern contain- to the computing limitation, we only select top n 451
402 ing more than one OOV terms, whose embeddings representative patterns. Therefore, X is actually 452
403 are not contained in Glove. a n × d matrix, where d is the dimension of each 453
404 Initial APE. In this paper, The APE embeds pattern embedding, w is a n-dimensional vector. 454
405 entities by a distribution on its context pattern Probability Adaption Using the MCTS Algo- 455
406 embeddings, where the pattern embeddings can rithm. Although the above approach can provide 456
407 represent different semantics the entity may be- unified representations for both patterns and enti- 457
408 long to or be related to; the distributional prob- ties, it can still fail to represent the underground 458
409 abilities on those pattern embeddings refers to semantic concept of seed entities. For example, 459
410 the different representativeness of difference pat- most context patterns of seed entity set {London, 460
411
terns. Intuitively, if a context pattern of an entity Paris, Beijing} are more related to the city con- 461
412
have higher distributional probability, the pattern cept, e.g., ”* is a big city”, but the above approach 462
413
is more likely to represent the main semantic con- can still assign higher probabilities to those unre- 463
cept of this entity and the entities matched by this lated patterns matching many other entities.
414 464
pattern will be more likely to belong to the same To address the above problem, we design a
415 465
category with this entity. probability adaption mechanism, which uses the
416 466
Therefore, if we can find n most representative MCTS algorithm to fine-tune the distributional
417 467
context patterns and calculate their distributional probabilities. Specifically, we fine-tune the prob-
418 468
probabilities based on their representativeness, we abilities of patterns using returned action values
419 469
can preciously represent the main semantics of en- after finishing MCTS simulations. As mentioned
420 470
tities. In this paper, we consider a pattern is repre- in Section 2.2, the action values of patterns are
421 sentative to an entity if: the pattern co-occurs fre- used as delayed feedback. And those patterns with 471
422 quently with the entity; the pattern matched as few higher delayed feedback are more likely to be rep- 472
423 other entities as possible. Based on this intuition, resentative ones, since they extract more positive 473
424 we calculate the representative score for each con- entities and less negative entities. Therefore, we 474
425 text pattern p of an entity e as follows: fine-tune the distributional probabilities as follow- 475
426 N (e, p) × log N (e, p) ing: 476
w(p, e) = (7)
427 C(p) wt (p, e) ∝ wt−1 (p, e) · Q(s0 , p) (9) 477
428 where C(p) is the number of different entities where wt−1 (p, E) is the probability of p at itera- 478
429 matched by p, N (e, p) is the frequency of pi co- tion t − 1, Q(s0 , p) is returned action value of pi 479
430 occurring with entity e. Based on the calculated in iteration t. 480
431 representative scores, the distributional probabili- 481
432 ties of the APE can be calculated by normalizing 3.2 Pattern Mover Similarity 482
433 all representative scores to sum 1. To compute the similarity of two APEs, we design 483
434 In addition, above approach can be also used a similarity measurement which considers both the 484
435 to calculate the APE for patterns and entity sets. similarity of pattern embeddings and the similarity 485
436
1) For patterns, we use the patterns themselves as of their probabilities. 486
437
their context patterns, and the corresponding dis- Given two APEs, i.e., two pattern embedding 487
438
tributional probability is 1.0. 2) For entity sets, matrices with distribution vectors, we calculate 488
e.g., a set of seed entities, we take the whole en- their similarity using the following formulas in-
439 489
tity set as the input. The context patterns are those spired by Kusner et al. (2015):
440 490
patterns which appears in the context of at least n
X
441 491
one entity in the entity set; and their distributional max Tij sim(i, j) (10)
442 T ≥0 492
probabilities can be calculated as follows: i,j=1
443 493
N (E, p) × log N (E, p) subject to:
444 w(p, E) = (8) 494
C(p) j=1
X
445 495
where E is the entity set, N (E, p) is the frequency Tij = wi ∀i ∈ 1, ..., n (11)
446 496
of p co-occurring with entities in E. n
447 Finally, we can denote the APE as < X, w >, i=1
X 497
448 where X is the context pattern embedding matrix, Tij = wj ∀j ∈ 1, ..., n (12) 498
449 n 499

5
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

500 {London, Paris, Beijing} Moscow Category Description Category Description 550
0.77 CAP Capital name FAC Man-made structures
501 the US Embassy in * the British Embassy in * ELE Chemical element ORG Organizations 551
FEM Female first name GPE Geo-political entities
502 MALE Male first name LOC Non-GPE locations 552
1.00
503 * is the capital of * is the capital of LAST Last name DAT A date or period 553
TTL Honorific title LANG Any named language
504 NORP Nationality, Reli- 554
at a meeting in * 0.48 hours after arriving in * gion, Political
505 555
506 during a visit to * the press conference in *
Table 1: Target categories used in our experiments. 556
0.64
507 557
0.48
508 * is located in the * as part of the 558
Sim({London, Paris, Beijing}, Moscow) = 0.674
them contains about 1 million sentences. We use
509 559
totally 12 categories of entities list in Shen et al.
510 560
Figure 3: The components of the PMS metric be- (2017) and compare the final entity scoring per-
511 tween the APEs of the seed entity set {London, Paris, 561
formance with it on both datasets.
512 Beijing} and entity Moscow (with equal distributional 562
Baselines. To evaluate the efficiency of the
513 probability for simplicity). The arrows represent flow 563
MCTS and the PMSN, we use several baselines:
514 between two patterns and their similarity. 564
1) POS: bootstrapping method which only uses
515 positive seeds and no no other constraint; 565
516 where sim(i, j) is the cosine similarity between 2) MEB(Curran et al., 2007): mutual exclusion 566
517 the ith pattern embedding of one entity and the ith bootstrapping method, which uses the cat- 567
518 pattern embedding of the other entity. And we de- egory exclusion as the constraints of boot- 568
519 note the above measurement as the Pattern Mover strapping; 569
520 Similarity (PMS) measurement. 3) COB(Shi et al., 2014): a probabilistic boot- 570
521 By using the above formula, PMS will assign strapping method which uses positive and 571
522
high similarity scores to those APE pairs if: there negative seeds both. 572
523
are many common or similar patterns between two 4) SetExpan(Shen et al., 2017): corpus-based 573
APE; they have similar high distributional proba- entity set expansion method, which adap-
524 574
bilities on those representative patterns. Figure 3 tively selects context features and unsuper-
525 575
shows an example about how PMS works. visedly ensemble them to score entities.
526 576
Using the above measurement, our method can Specifically, we compare baselines (1)-(3) and
527 577
uniformly calculate the similarity of any two APEs our method on Google Web 1T; we compare base-
528 578
of entities, patterns or entity sets. line (4) and our method on APR and Wiki.
529 579
Metrics. To compare our methods with tradi-
530 4 Experiments 580
531
tional bootstrapping methods on Google web 1T 581
In this section, we demonstrate the experimen- datatset, we use P@n (precision at top n), and
532 582
tal settings and the results of our bootstrapping the mean average precision (MAP) as the same as
533 583
method on entity set expansion. Shi et al. (2014). As for APR and Wiki datasets,
534 584
we use MAP@n (n=10,20,50) to evaluate entity
535 4.1 Experimental Settings 585
scoring performance of our method. In our ex-
536 586
Corpus. We conduct experiments on three pub- periments, we select seed entities manually; the
537 587
lic datasets: Google Web 1T corpus (Brants and correctness of all extracted entities is manually
538 588
Franz, 2006), APR and Wiki (Shen et al., 2017). judged, with some external support resources, e.g.,
539 the entity list collected from Wikipedia. 589
1) Google Web 1T corpus contains a large scale
540 590
of n-grams compiled from a 1 trillion words cor-
541
pus. Following Shi et al. (2014), we use 5-grams 4.2 Experimental Results 591
542 as the entity context and filter out those 5-grams Comparison with three baseline methods on 592
543 containing all stopwords or common words. We Google Web 1T corpus. Table 2 shows the per- 593
544 use 13 categories of entities (see Figure 1) list in formance of different bootstrapping methods on 594
545 Shi et al. (2014) and compare our method with tra- Google Web 1T dataset. We can see that our full 595
546 ditional bootstrapping ESE methods on this cor- model outperforms our three baseline methods: 596
547 pus. 2) APR (2015 news from AP and Reuters) specifically, comparing with the baseline POS, our 597
548 and Wiki (a subset of English Wikipedia) are two method can achieve 41% improvement on P@100, 598
549 datasets published by Shen et al. (2017). Each of 35% improvement on P@200 and 45% improve- 599

6
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

600 Method P@10 P@20 P@50 P@100 P@200 MAP 650


601 POS 0.84 0.74 0.55 0.41 0.34 0.42 651
602 MEB 0.83 0.79 0.68 0.58 0.51 - 652
COB* 0.97 0.96 0.9 0.79 0.66 0.85
603 653
604 Oursfull 0.97 0.96 0.92 0.82 0.69 0.87 654
Ours-MCTS 0.85 0.81 0.73 0.63 0.52 0.75
605 655
Ours-PMSN 0.63 0.6 0.56 0.48 0.42 0.61
606 656
607 Table 2: Overall results for entity set expansion on Google Web 1T dataset, where Oursfull is the full version of our 657
608 method, Ours-MCTS is our method with the MCTS disabled, and Ours-PMSN is our method but replacing the PMSN 658
609 with fixed word embeddings. * indicates COB uses the human feedback for seed entity selection. 659
610 660
611 APR Category P@20 P@50 P@100 P@200 661
Method
MAP@10 MAP@20 MAP@50
612 CAP 1.00 1.00 0.94 0.69 662
SetExpan 0.897 0.862 0.789
613 Ours 0.913 0.853 0.797 ELE 1.00 0.84 0.51 0.36 663
614
FEN 1.00 1.00 1.00 0.95 664
Wiki MALE 1.00 1.00 1.00 0.98
615 Method LAST 1.00 1.00 1.00 1.00 665
MAP@10 MAP@20 MAP@50
616 TTL 0.85 0.64 0.49 0.34 666
SetExpan 0.957 0.901 0.745 NORP 0.95 0.96 0.89 0.60
617 Ours 0.963 0.925 0.786 FAC 0.95 0.86 0.59 0.42 667
618 ORG 1.00 1.00 1.00 0.94 668
Table 3: The adaptive entity scoring performance of GPE 1.00 1.00 1.00 0.94
619 669
different methods on APR and Wiki dataset. LOC 0.80 0.76 0.70 0.67
620 DAT 1.00 1.00 0.82 0.56 670
621 LANG 1.00 0.98 0.81 0.52 671
622 ment on MAP; comparing with baseline MEB, our 672
Table 4: The performance of our full method on differ-
623 method can achieve 24% improvement on P@100 673
ent categories on Google Web 1T dataset.
624 and 18% improvement on P@200; comparing to 674
625 baseline COB, our method achieves 3% improve- 675
626 ment on both P@100 and P@200, and 2% im- bedding, the performance decreases by 34% on 676
627 provement on MAP. That means our method can P@100 and 27% on P@200. The results show that 677
628
extract more true entities and assign a higher rank- both the PMSN and the MCTS algorithm are crit- 678
629
ing score to these entities. ical for our model’s performance. And they can 679
Comparison with SetExpan on APR and be enhanced by each other: the PMSN can learn a
630 680
Wiki. To further verify that our method can learn a better representation by combing with the MCTS
631 681
better representation and adaptively score entities algorithm; the MCTS can effectively estimate de-
632 682
in ESE, we compare our method with the novel layed feedback using the PMSN.
633 683
entity set expansion method–SetExpan, which is a The performance of our full method on dif-
634 684
non-bootstrapping method(see Table 3). In Table ferent categories on Google Web 1T dataset.
635 685
3, we can see that our method outperforms SetEx- Table 4 shows the performance of our method on
636 686
pan on both APR and Wiki datasets: our method different categories on Google Web 1T dataset.
637 achieves 0.8% improvement on MAP@50 on APR 687
We can see that our method can achieve high per-
638 dataset; and 4.1% improvement on MAP@50 on 688
formance in most categories except for ELE, TTL
639 Wiki dataset. This further verifies that our method 689
and FAC entities: the lower performance on ELE
640 can improve the performance of bootstrapping for 690
entities may mainly in order that total elements’
641 Entity Set Expansion. number is less than 150, and most of them oc- 691
642 cur seldomly; the lower performance on TTL and 692
643 4.3 Detailed Analysis FAC entities may mainly in order that context pat- 693
644 Comparison with the Ours-MCTS
method and terns of TTL entities are similar to context patterns 694
645 the Ours -PMSN method. From Table 2, we can of person names, and context patterns of FAC are 695
646 also see that if we replace the Monte Carlo Tree similar to context pattern of location entities. 696
647 Search by selecting top-n patterns, the perfor- Top entities selected by different methods. 697
648 mance decreases by 19% on P@100 and 17% on To intuitively demonstrate the effectiveness of de- 698
649 P@200; if we replace the PMSN with word em- layed feedback in our method, we illustrate the top 699

7
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

700 Iters Patterns selected by Oursfull Patterns selected by Patterns selected by Ours-PMSN 750
701
Ours-MCTS 751
1 Embassy of Sweden in * held a meeting in * * and New York in
702 2 Embassy of Belgium in * * meeting was held on in New York or * 752
703 3 ’s capital city of * * Meeting to be held between New York and * 753
4 * is the capital city * held its first meeting * hotel reservations with discount
704 754
5 * was the capital of * meeting to be held * is a great city
705 755
706 Table 5: The top patterns selected by different methods when expanding capital entities in the first 5 iterations. 756
707 757
708 758
1 pattern in the first five iterations of three meth- is directly labeling the extracted entities by se-
709 759
ods in Table 5. From Table 5, we can see that the lected patterns as positive ones (Curran et al.,
710 top patterns by our full method are more related 2007; Carlson et al., 2010); a more effective ap- 760
711 to seed entities than other two baselines. Besides, proach is to leverage the pattern matching statis- 761
712 we can also see that without the MCTS algorithm tics (Riloff, 1996; Agichtein and Gravano, 2000; 762
713 or PMSN , most top patterns are less related and Thelen and Riloff, 2002; Pantel and Pennacchiotti, 763
714 easily semantic drift to other categories. 2006; Kozareva and Hovy, 2010), or one-hot con- 764
715 text pattern embeddings (Yangarber et al., 2000; 765
716 5 Related Work Paşca, 2007; Pantel et al., 2009) inner the boot- 766
717
Weakly supervised methods for information ex- strapping process. However, these methods de- 767
718
traction (IE) are often provided scant supervision, pend on the iteration process and often fail for 768
719 such as knowledge base facts as distant supervi- sparse matching patterns. Recently, Gupta and 769
720 sion (Mintz et al., 2009; Hoffmann et al., 2011; Manning (2015) use word embedding similarity 770
721 Zeng et al., 2015), and light amount of supervision of entities to help entity classification; Shen et al. 771
722 samples in bootstrapping(Riloff and Jones, 1999; (2017) use a non-bootstrapping method to adap- 772
723 Carlson et al., 2010; Gupta and Manning, 2014), tively select context features and ensemble them 773
724 or label propagation (Chen et al., 2006). As a to evaluate entities; Berger et al. (2018) exploit hu- 774
725 classical technique, bootstrapping usually exploits man labels an external supervision to guide boot- 775
726 pattern (Curran et al., 2007; Shi et al., 2014), docu- strapping and learn a custom embedding to help 776
ment (Stevenson and Greenwood, 2005; Liao and human annotating.
727 777
728 Grishman, 2010) or syntactic and semantic con- 778
6 Conclusions
729 textual features (He and Grishman, 2015; Batista 779
730 et al., 2015) to extract and classify new instances. In this paper, we propose a deep similarity 780
731 Limited to the sparse supervision, many previ- network-based model combined with the MCTS 781
732 ous bootstrapping methods often combine super- algorithm to bootstrap Entity Set Expansion. 782
733
vision signals with statistical features in previous Specifically, we leverage the Monte Carlo Tree 783
734
iterations to evaluate and select patterns. Among Search (MCTS) algorithm to efficiently estimate 784
them, the RlogF-like function (Agichtein and Gra- the delayed feedback of each pattern in the boot-
735 785
vano, 2000; Pantel and Pennacchiotti, 2006; Gupta strapping; we propose a Pattern Mover Similar-
736 786
and Manning, 2014) is the most popular method, ity Network (PMSN) to uniformly embed entities
737 787
but often suffers from semantic drift problem. and patterns using a distribution on context pat-
738 788
Some other studies introduce extra constraints tern embeddings; we combine the MCTS and the
739 789
when selecting patterns, e.g., parallel mutual ex- PMSN to adaptively learn a better embedding for
740 790
clusion (Curran et al., 2007; McIntosh and Curran, evaluating both patterns and entities. Experimen-
741 791
2008), coupling constraints (Carlson et al., 2010), tal results show that our method can select bet-
742 and specific extraction pattern forms (Kozareva ter patterns based on efficiently estimated delayed 792
743 and Hovy, 2010). Besides, graph-based meth- feedback and learn a better entity scoring function 793
744 ods (Li et al., 2011; Tao et al., 2015) and the using the PMSN combined with the MCTS algo- 794
745 probability-based method (Shi et al., 2014) are rithm. For future work, because bootstrapping is 795
746 also used to improve the bootstrapping perfor- a common method for many information extrac- 796
747 mance. However, all these methods exploit only tion tasks, we will design new models and apply 797
748 instant feedback rather than delayed feedback. them on related tasks, such as relation extraction, 798
749 To evaluate entity scores, a simple approach knowledge fusion, etc. 799

8
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

800 References 31–35, Denver, Colorado. Association for Compu- 850


801 tational Linguistics. 851
Eugene Agichtein and Luis Gravano. 2000. Snowball:
802 Extracting Relations from Large Plain-text Collec- 852
Raphael Hoffmann, Congle Zhang, Xiao Ling,
803 tions. In Proceedings of the Fifth ACM Conference Luke Zettlemoyer, and Daniel S. Weld. 2011. 853
804 on Digital Libraries, pages 85–94, NY, USA. Knowledge-based Weak Supervision for Informa- 854
805 tion Extraction of Overlapping Relations. In Pro- 855
David S. Batista, Bruno Martins, and Mário J. Silva.
ceedings of the 49th Annual Meeting of the Associ-
806 2015. Semi-Supervised Bootstrapping of Relation- 856
ation for Computational Linguistics: Human Lan-
ship Extractors with Distributional Semantics. In
807 guage Technologies, pages 541–550, Stroudsburg, 857
Proceedings of the 2015 Conference on Empirical
808 PA, USA. Association for Computational Linguis- 858
Methods in Natural Language Processing, pages
tics.
809 499–504, Lisbon, Portugal. Association for Compu- 859
810 tational Linguistics. Zornitsa Kozareva and Eduard Hovy. 2010. Learning 860
811 Matthew Berger, Ajay Nagesh, Joshua Levine, Mi- Arguments and Supertypes of Semantic Relations 861
hai Surdeanu, and Helen Zhang. 2018. Visual Su- Using Recursive Patterns. In Proceedings of the
812 862
pervision in Bootstrapped Information Extraction. 48th Annual Meeting of the Association for Compu-
813 tational Linguistics, pages 1482–1491, Stroudsburg, 863
In Proceedings of the 2018 Conference on Em-
814
pirical Methods in Natural Language Processing, PA, USA. Association for Computational Linguis- 864
815 pages 2043–2053, Brussels, Belgium. Association tics. 865
816 for Computational Linguistics. 866
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
817 Andrew Carlson, Justin Betteridge, Richard C. Wang, Weinberger. 2015. From word embeddings to docu- 867
818 Estevam R. Hruschka, Jr., and Tom M. Mitchell. ment distances. In Proceedings of the 32nd Interna- 868
2010. Coupled Semi-supervised Learning for In- tional Conference on Machine Learning, volume 37,
819 869
formation Extraction. In Proceedings of the Third pages 957–966, Lille, France.
820 870
ACM International Conference on Web Search and
821 Data Mining, pages 101–110, NY, USA. Haibo Li, Danushka Bollegala, Yutaka Matsuo, and 871
Mitsuru Ishizuka. 2011. Using Graph Based
822 872
Jinxiu Chen, Donghong Ji, Chew Lim Tan, and Method to Improve Bootstrapping Relation Extrac-
823 Zhengyu Niu. 2006. Relation Extraction Using La- tion. In Computational Linguistics and Intelli- 873
824 bel Propagation Based Semi-supervised Learning. gent Text Processing, volume 6609, pages 127–138. 874
In Proceedings of the 21st International Conference Springer Berlin Heidelberg, Berlin, Heidelberg.
825 875
on Computational Linguistics and the 44th Annual
826 Shasha Liao and Ralph Grishman. 2010. Filtered 876
Meeting of the Association for Computational Lin-
827 guistics, pages 129–136, Stroudsburg, PA, USA. As- Ranking for Bootstrapping in Event Extraction. 877
sociation for Computational Linguistics. In Proceedings of the 23rd International Confer-
828 878
ence on Computational Linguistics, pages 680–688,
829 Stroudsburg, PA, USA. Association for Computa- 879
James R. Curran, Tara Murphy, and Bernhard Scholz.
830 2007. Minimising semantic drift with mutual exclu- tional Linguistics. 880
831 sion bootstrapping. In Proceedings of the 10th Con- 881
ference of the Pacific Association for Computational Tara McIntosh. 2010. Unsupervised Discovery of Neg-
832 ative Categories in Lexicon Bootstrapping. In Pro- 882
Linguistics, volume 6, pages 172–180. Citeseer.
833 ceedings of the 2010 Conference on Empirical Meth- 883
834 Sonal Gupta and Christopher Manning. 2014. Im- ods in Natural Language Processing, pages 356– 884
proved Pattern Learning for Bootstrapped Entity Ex- 365, Stroudsburg, PA, USA. Association for Com-
835 885
traction. In Proceedings of the Eighteenth Confer- putational Linguistics.
836 ence on Computational Natural Language Learning, 886
837 pages 98–108, Ann Arbor, Michigan. Association Tara McIntosh and James R. Curran. 2008. Weighted 887
for Computational Linguistics. Mutual Exclusion Bootstrapping for Domain Inde-
838 888
pendent Lexicon and Template Acquisition. In Pro-
839 Sonal Gupta and Christopher D. Manning. 2015. Dis- ceedings of the Australasian Language Technology 889
840 tributed Representations of Words to Guide Boot- Association Workshop 2008, pages 97–105, Hobart, 890
strapped Entity Classifiers. In Proceedings of the Australia.
841 891
2015 Conference of the North American Chapter
842 of the Association for Computational Linguistics: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey 892
843 Human Language Technologies, pages 1215–1220, Dean. 2013. Efficient estimation of word represen- 893
844 Denver, Colorado. Association for Computational tations in vector space. 894
Linguistics.
845 Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- 895
846 Yifan He and Ralph Grishman. 2015. ICE: Rapid sky. 2009. Distant Supervision for Relation Extrac- 896
847
Information Extraction Customization for NLP tion Without Labeled Data. In Proceedings of the 897
Novices. In Proceedings of the 2015 Conference of Joint Conference of the 47th Annual Meeting of the
848 the North American Chapter of the Association for ACL and the 4th International Joint Conference on 898
849 Computational Linguistics: Demonstrations, pages Natural Language Processing of the AFNLP, pages 899

9
ACL 2019 Submission 1920. Confidential Review Copy. DO NOT DISTRIBUTE.

900 1003–1011, Stroudsburg, PA, USA. Association for Mark Stevenson and Mark A. Greenwood. 2005. A 950
901 Computational Linguistics. Semantic Approach to IE Pattern Induction. In Pro- 951
902
ceedings of the 43rd Annual Meeting on Associa- 952
Dana Movshovitz-Attias and William W. Cohen. 2012. tion for Computational Linguistics, pages 379–386,
903 Stroudsburg, PA, USA. Association for Computa- 953
Bootstrapping Biomedical Ontologies for Scientific
904 Text Using NELL. pages 11–19. tional Linguistics. 954
905 955
Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana- Fangbo Tao, Bo Zhao, Ariel Fuxman, Yang Li, and
906 Jiawei Han. 2015. Leveraging Pattern Semantics 956
Maria Popescu, and Vishnu Vyas. 2009. Web-scale
907 Distributional Similarity and Entity Set Expansion. for Extracting Entities in Enterprises. In Proceed- 957
908 In Proceedings of the 2009 Conference on Empirical ings of the 24th International Conference on World 958
Methods in Natural Language Processing: Volume 2 Wide Web, pages 1078–1088, Republic and Canton
909 959
- Volume 2, pages 938–947, Stroudsburg, PA, USA. of Geneva, Switzerland. International World Wide
910 Web Conferences Steering Committee. 960
Association for Computational Linguistics.
911 961
Michael Thelen and Ellen Riloff. 2002. A Bootstrap-
912 Patrick Pantel and Marco Pennacchiotti. 2006. ping Method for Learning Semantic Lexicons Using 962
913 Espresso: Leveraging Generic Patterns for Au- Extraction Pattern Contexts. In Proceedings of the 963
tomatically Harvesting Semantic Relations. In ACL-02 Conference on Empirical Methods in Natu-
914 964
Proceedings of the 21st International Conference ral Language Processing, pages 214–221, Strouds-
915 on Computational Linguistics and the 44th Annual 965
burg, PA, USA. Association for Computational Lin-
916 Meeting of the Association for Computational guistics. 966
917
Linguistics, pages 113–120, Stroudsburg, PA, USA. 967
Association for Computational Linguistics. Roman Yangarber. 2003. Counter-training in Discov-
918 968
ery of Semantic Patterns. In Proceedings of the 41st
919 Marius Paşca. 2007. Weakly-supervised Discovery of Annual Meeting on Association for Computational 969
920 Named Entities Using Web Search Queries. In Pro- Linguistics - Volume 1, pages 343–350, Stroudsburg, 970
ceedings of the Sixteenth ACM Conference on Con- PA, USA. Association for Computational Linguis-
921 971
ference on Information and Knowledge Manage- tics.
922 ment, pages 683–690, New York, NY, USA. ACM. 972
923 Roman Yangarber, Ralph Grishman, Pasi Tapanainen, 973
Jeffrey Pennington, Richard Socher, and Christopher and Silja Huttunen. 2000. Automatic Acquisition
924 974
Manning. 2014. Glove: Global Vectors for Word of Domain Knowledge for Information Extraction.
925 In Proceedings of the 18th Conference on Com- 975
Representation. In Proceedings of the 2014 Con-
926 ference on Empirical Methods in Natural Language putational Linguistics - Volume 2, pages 940–946, 976
927 Processing, pages 1532–1543, Doha, Qatar. Associ- Stroudsburg, PA, USA. Association for Computa- 977
ation for Computational Linguistics. tional Linguistics.
928 978
929
Ashequl Qadir, Pablo N Mendes, Daniel Gruhl, and Roman Yangarber, Winston Lin, and Ralph Grish- 979
930 Neal Lewis. 2015. Semantic Lexicon Induction man. 2002. Unsupervised Learning of Generalized 980
from Twitter with Pattern Relatedness and Flexible Names. In Proceedings of the 19th International
931 Conference on Computational Linguistics - Volume 981
Term Length. In Twenty-Ninth AAAI Conference on
932
Artificial Intelligence, pages 2432–2439. 1, pages 1–7, Stroudsburg, PA, USA. Association 982
933 for Computational Linguistics. 983
934 Ellen Riloff. 1996. An empirical study of automated Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 984
935 dictionary construction for information extraction in 2015. Distant Supervision for Relation Extraction 985
three domains. volume 85, pages 101–134. Elsevier. via Piecewise Convolutional Neural Networks. In
936 986
Proceedings of the 2015 Conference on Empirical
937 Ellen Riloff and Rosie Jones. 1999. Learning dictionar- 987
Methods in Natural Language Processing, pages
938 ies for information extraction by multi-level boot- 1753–1762, Lisbon, Portugal. Association for Com- 988
939 strapping. In AAAI/IAAI, pages 474–479. putational Linguistics. 989
940 990
Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo
941 Shang, Xiang Ren, and Jiawei Han. 2017. Set- 991
942 Expan: Corpus-Based Set Expansion via Context 992
943
Feature Selection and Rank Ensemble. In ECML 993
PKDD, volume 10534, pages 288–304, Cham.
944 Springer International Publishing. 994
945 995
946 Bei Shi, Zhenzhong Zhang, Le Sun, and Xianpei Han. 996
2014. A probabilistic co-bootstrapping method for
947 997
entity set expansion. In Proceedings of COLING
948 2014, the 25th International Conference on Compu- 998
949 tational Linguistics, pages 2280–2290. 999

10

Das könnte Ihnen auch gefallen