Sie sind auf Seite 1von 21

Constraint-Based Entity Matching

Warren Shen, Xin Li, AnHai Doan


Database & AI Groups
University of Illinois, Urbana
Entity Matching
 Decide if mentions refer to the same real-world
entity

Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001


Chen Li, Doug Chan. “Ensemble Learning”
C. Li, D. Chan. “Ensemble Learning”. ICML 2003

 Key problem in numerous applications


– Information integration
– Natural language understanding
– Semantic Web

2
State of the Art
 Numerous solutions in the AI, Database, and Web
communities
– Cohen, Ravikumar, & Fienberg 2003
– Li, Morie, & Roth 2004
– Bhattacharya & Getoor 2004
– McCallum, Nigam, & Ungar 2000
– Pasula et. al. 2003
– Wellner et. al. 2004

 Mostsolutions largely exploit only syntactic


similarity
– “Jeff Smith” ≈ “J. Smith”
– “(217) 235-1234” ≈ “235-1234”

3
Semantic Constraints
Incompatible

C. Li. “User Interfaces”. SIGCHI 2000


C. Li, J. Smith. “Numerical Analysis”. SIAM 2001

Subsumption
Chris Li’s Homepage “Numerical Analysis”, SIAM 2001 with J. Smith.

DBLP Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001

Layout
Chen Li, Doug Chan. “Ensemble Learning”. ICML 2003
Chen Li’s Homepage
C. Li. “Data Mining”. KDD 2000

4
Numerous Semantic Constraint Types

Type Example
Aggregate No researcher has chaired more than 3 conferences in a year
Subsumption If a citation X from DBLP matches a citation Y in a homepage, then each author in
Y matches some author in X
Neighborhood If authors X and Y share similar names and some coauthors, they are likely to
match
Incompatible No researcher exists who has published in both HCI and numerical analysis
Layout If two mentions in the same document share similar names, they are likely to
match
Uniqueness Mentions in the PC listing of a conference refer to different researchers
Ordering If two citations match,then their authors will be matched in order
Individual The researcher named “Mayssam Saria” has fewer than five mentions in DBLP
(e.g. being a new graduate student with fewer than five papers)

5
Our Contributions
 Develop a solution to exploit semantic constraints
– Models constraints in a uniform probabilistic manner
– Clusters mentions using a generative model
– Uses relaxation labeling to handle constraints
– Adds a pairwise layer to further improve accuracy

 Experimental results on two real-world domains


– Researchers, IMDB
– Improved accuracy over state of the art by 3-12% F-1

6
Probabilistic Modeling of Constraints
 Modeled as the effect on the probability that a mention
refers to a real-world entity

“If two mentions in the same document share similar names, they are
likely to match”:

m1: Chen Li → e1
P (m2=e1 | m1 = e1) = 0.8
m2: C. Li

 Constraint probabilities have a natural interpretation

 Can be learned or manually specified by a domain expert

7
The Entity Matching Problem
m1:Chen Li
Documents:
m2:C. Li m3:Chris Lee

d1 d2

Constraints: c1 = layout constraint p(c1) = 0.8

Matching Pairs: m1 = m2

Solution
l Model document generation
l Cluster mentions using this model
8
Modeling Document Generation
 Generate mentions for each document
– Select entities
– Generate and “sprinkle” mentions
 Check constraints for each mention
– Decide whether to enforce
constraint c E θ
– If enforced, check if
e1 e2 e2
mention violates c
Chen Li Chris Lee Chris Lee
– If yes, discard documents
and repeat process m1:Chen Li

m2:C. Li m3: Chris Lee


(Extension of model in d1 d2
Li, Morie & Roth 2004) c1: layout constraint p(c1) = 0.8
9
Clustering with the Generative Model
 Find mention assignments F and model parameters
θ to maximize P (D, F |θ )

 Difficult to compute exactly, so use a variant of EM

θ1 = arg max (  , 0 θ ) 1 = arg max (  ,  θ1 )


θ 

θ 2 = arg max (  , 1 θ ) 2 = arg max (  ,  θ 2 ) ...


θ 

( *,θ *) = arg max (  , θ )


,θ

10
Incorporating Constraints
 Extend the step that assigns mentions

Ft = arg max P ( D, F θ t )
F

– Basic mention assignment:


P ( m | e) P ( e)
m = arg max P(e | m) =
e P ( m)

– Extension: Use constraints to improve mention


assignments

11
Enforcing Constraints on Clusters
 Apply constraints at each iteration

Compute parameters Assign mentions Apply constraints

 Use
relaxation labeling to apply constraints to
mention assignments

12
Relaxation Labeling
 Start with an initial labeling of mentions with entities
 Iteratively improve mention labels, given constraints
Chen Li = e1 Chris Lee = e2

C. Li = e2 Jane Smith = e4

Y. Lee = e3
C. Lee = e2

Smith, J = e4

Constraints: c1 = layout constraint p(c1) = 0.8

 Can be extended to probabilistic constraints


 Scalable

13
Relaxation Labeling
 Start with an initial labeling of mentions with entities
 Iteratively improve mention labels, given constraints
Chen Li = e1 Chris Lee = e2

C. Li = e2 → e1 Jane Smith = e4

Y. Lee = e3
C. Lee = e2

Smith, J = e4

Constraints: c1 = layout constraint p(c1) = 0.8

 Can be extended to probabilistic constraints


 Scalable

14
Handling Probabilistic Constraints
 Relaxation labeling can combine multiple
probabilistic constraints

P(m = e) = ∑ P (m = e, Om )
Om

= ∑ P(m = e | Om )P(Om )
Om

= ∑ P(m = e | f1 , L , f n )P(Om )
Om

 n 
∝ ∑ σ  ∑ α k f k (Om , m, e)  × ∏ P(mi = ei )
Om  k =1  ( mi =ei )∈Om

15
Pairwise Layer
 So far, we have applied constraints to clusters
Compute parameters Assign mentions Apply constraints

 It may be unclear how to enforce constraints on clusters

Li, Chen Chen Li Constraint: C. Li ≠ Li, C.


C. Li Li, C.
Remove C. Li or Li, C. ?

 Add a pairwise layer


– Convert clusters into predicted matching pairs
– Remove only pairs that negative pairwise hard constraints apply to
16
Empirical Evaluation
 Two real-world domains
– Researchers, IMDB

 For each domain


– Collected documents
– Researchers: homepages from DBLP and the web
– IMDB: text and structured records from IMDB
– Marked up mentions and their attributes
– 4,991 researcher mentions
– 3,889 movie titles from IMDB
– Manually identified all correct matching pairs

 Evaluation Metric:
Precision = # true positives / # predicted pairs
Recall = # true positives / # correct pairs
F1 = (2 * P * R) / (P + R)
17
Using Constraints Improves Accuracy
 Relaxation labeler improves F-1 by 3-12%

F1 (P / R) Researchers Movies
Baseline .66 (.67/.65) .69 (.61/.79)
Baseline + Relax .78 (.78/.78) .72 (.63/.83)
Baseline + Relax + Pairwise .79 (.80/.79) .73 (.64/.83)

 Relaxation labeling very fast

18
Using Constraints Individually
 Each constraint makes a contribution

Researchers F1 (P / R) Movies F1 (P / R)
Baseline .66 (.67/.65) Baseline .69 (.61/.79)
+ Rare Value .66 (.67/.66) + Incompatible .70 (.62/.79)
+ Subsumption .67 (.68/.65) + Neighborhood .70 (.62/.81)
+ Neighborhood .70 (.68/.72) + Individual .71 (.62/.82)
+ Individual .70 (.77/.64)
+ Layout .71 (.68/.74)

19
Related Work
 Much work in entity matching
Cohen, Ravikumar, & Fienberg 2003
Li, Morie, & Roth 2004
Bhattacharya & Getoor 2004
McCallum, Nigam, & Ungar 2000
Pasula et. al. 2003
Wellner et. al. 2004
 Recent work has looked at exploiting semantic
constraints
– Personal Information Management (Dong et. al. 2004)
– Profiler based entity matching (Doan et. al. 2003)
 Semantic constraints successfully exploited in other
applications
– Clustering algorithms (Bilenko et. al. 2004), ontology matching (Doan et. al.
2002)

20
Summary and Future Work
 Exploit semantic constraints in entity matching
– Models constraints in a uniform probabilistic manner
– Uses a generative model and relaxation labeling to
handle constraints in a scalable way
– Experimental results on two real-world domains show
effectiveness

 Future work: Learning constraints effectively from


current or external data

21

Das könnte Ihnen auch gefallen