Cluster Ppt5

Constraint-Based Entity Matching
Warren Shen, Xin Li, AnHai Doan

Database & AI Groups
University of Illinois, Urbana
Entity Matching
 Decide if mentions refer to the same real-world
entity
Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001

Chen Li, Doug Chan. “Ensemble Learning”
C. Li, D. Chan. “Ensemble Learning”. ICML 2003
 Key problem in numerous applications

– Information integration
– Natural language understanding
– Semantic Web
2
State of the Art
 Numerous solutions in the AI, Database, and Web
communities
– Cohen, Ravikumar, & Fienberg 2003
– Li, Morie, & Roth 2004
– Bhattacharya & Getoor 2004
– McCallum, Nigam, & Ungar 2000
– Pasula et. al. 2003
– Wellner et. al. 2004
 Mostsolutions largely exploit only syntactic

similarity
– “Jeff Smith” ≈ “J. Smith”
– “(217) 235-1234” ≈ “235-1234”
3
Semantic Constraints
Incompatible
C. Li. “User Interfaces”. SIGCHI 2000

C. Li, J. Smith. “Numerical Analysis”. SIAM 2001
Subsumption
Chris Li’s Homepage “Numerical Analysis”, SIAM 2001 with J. Smith.
DBLP Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001
Layout
Chen Li, Doug Chan. “Ensemble Learning”. ICML 2003
Chen Li’s Homepage
C. Li. “Data Mining”. KDD 2000
4
Numerous Semantic Constraint Types
Type Example
Aggregate No researcher has chaired more than 3 conferences in a year
Subsumption If a citation X from DBLP matches a citation Y in a homepage, then each author in
Y matches some author in X
Neighborhood If authors X and Y share similar names and some coauthors, they are likely to
match
Incompatible No researcher exists who has published in both HCI and numerical analysis
Layout If two mentions in the same document share similar names, they are likely to
match
Uniqueness Mentions in the PC listing of a conference refer to different researchers
Ordering If two citations match,then their authors will be matched in order
Individual The researcher named “Mayssam Saria” has fewer than five mentions in DBLP
(e.g. being a new graduate student with fewer than five papers)
5
Our Contributions
 Develop a solution to exploit semantic constraints
– Models constraints in a uniform probabilistic manner
– Clusters mentions using a generative model
– Uses relaxation labeling to handle constraints
– Adds a pairwise layer to further improve accuracy
 Experimental results on two real-world domains

– Researchers, IMDB
– Improved accuracy over state of the art by 3-12% F-1
6
Probabilistic Modeling of Constraints
 Modeled as the effect on the probability that a mention
refers to a real-world entity
“If two mentions in the same document share similar names, they are
likely to match”:
m1: Chen Li → e1
P (m2=e1 | m1 = e1) = 0.8
m2: C. Li
 Constraint probabilities have a natural interpretation
 Can be learned or manually specified by a domain expert
7
The Entity Matching Problem
m1:Chen Li
Documents:
m2:C. Li m3:Chris Lee
d1 d2
Constraints: c1 = layout constraint p(c1) = 0.8
Matching Pairs: m1 = m2
Solution
l Model document generation
l Cluster mentions using this model
8
Modeling Document Generation
 Generate mentions for each document
– Select entities
– Generate and “sprinkle” mentions
 Check constraints for each mention
– Decide whether to enforce
constraint c E θ
– If enforced, check if
e1 e2 e2
mention violates c
Chen Li Chris Lee Chris Lee
– If yes, discard documents
and repeat process m1:Chen Li
m2:C. Li m3: Chris Lee

(Extension of model in d1 d2
Li, Morie & Roth 2004) c1: layout constraint p(c1) = 0.8
9
Clustering with the Generative Model
 Find mention assignments F and model parameters
θ to maximize P (D, F |θ )
 Difficult to compute exactly, so use a variant of EM
θ1 = arg max (  , 0 θ ) 1 = arg max (  ,  θ1 )

θ 
θ 2 = arg max (  , 1 θ ) 2 = arg max (  ,  θ 2 ) ...

θ 
( *,θ *) = arg max (  , θ )

,θ
10
Incorporating Constraints
 Extend the step that assigns mentions
Ft = arg max P ( D, F θ t )
F
– Basic mention assignment:

P ( m | e) P ( e)
m = arg max P(e | m) =
e P ( m)
– Extension: Use constraints to improve mention

assignments
11
Enforcing Constraints on Clusters
 Apply constraints at each iteration
Compute parameters Assign mentions Apply constraints
 Use
relaxation labeling to apply constraints to
mention assignments
12
Relaxation Labeling
 Start with an initial labeling of mentions with entities
 Iteratively improve mention labels, given constraints
Chen Li = e1 Chris Lee = e2
C. Li = e2 Jane Smith = e4
Y. Lee = e3
C. Lee = e2
Smith, J = e4
 Can be extended to probabilistic constraints

 Scalable
13
Relaxation Labeling
 Start with an initial labeling of mentions with entities
 Iteratively improve mention labels, given constraints
Chen Li = e1 Chris Lee = e2
C. Li = e2 → e1 Jane Smith = e4
Y. Lee = e3
C. Lee = e2
Smith, J = e4
 Can be extended to probabilistic constraints

 Scalable
14
Handling Probabilistic Constraints
 Relaxation labeling can combine multiple
probabilistic constraints
P(m = e) = ∑ P (m = e, Om )
Om
= ∑ P(m = e | Om )P(Om )
Om
= ∑ P(m = e | f1 , L , f n )P(Om )
Om
 n 
∝ ∑ σ  ∑ α k f k (Om , m, e)  × ∏ P(mi = ei )
Om  k =1  ( mi =ei )∈Om
15
Pairwise Layer
 So far, we have applied constraints to clusters
Compute parameters Assign mentions Apply constraints
 It may be unclear how to enforce constraints on clusters
Li, Chen Chen Li Constraint: C. Li ≠ Li, C.

C. Li Li, C.
Remove C. Li or Li, C. ?
 Add a pairwise layer

– Convert clusters into predicted matching pairs
– Remove only pairs that negative pairwise hard constraints apply to
16
Empirical Evaluation
 Two real-world domains
– Researchers, IMDB
 For each domain

– Collected documents
– Researchers: homepages from DBLP and the web
– IMDB: text and structured records from IMDB
– Marked up mentions and their attributes
– 4,991 researcher mentions
– 3,889 movie titles from IMDB
– Manually identified all correct matching pairs
 Evaluation Metric:
Precision = # true positives / # predicted pairs
Recall = # true positives / # correct pairs
F1 = (2 * P * R) / (P + R)
17
Using Constraints Improves Accuracy
 Relaxation labeler improves F-1 by 3-12%
F1 (P / R) Researchers Movies
Baseline .66 (.67/.65) .69 (.61/.79)
Baseline + Relax .78 (.78/.78) .72 (.63/.83)
Baseline + Relax + Pairwise .79 (.80/.79) .73 (.64/.83)
 Relaxation labeling very fast
18
Using Constraints Individually
 Each constraint makes a contribution
Researchers F1 (P / R) Movies F1 (P / R)
Baseline .66 (.67/.65) Baseline .69 (.61/.79)
+ Rare Value .66 (.67/.66) + Incompatible .70 (.62/.79)
+ Subsumption .67 (.68/.65) + Neighborhood .70 (.62/.81)
+ Neighborhood .70 (.68/.72) + Individual .71 (.62/.82)
+ Individual .70 (.77/.64)
+ Layout .71 (.68/.74)
19
Related Work
 Much work in entity matching
Cohen, Ravikumar, & Fienberg 2003
Li, Morie, & Roth 2004
Bhattacharya & Getoor 2004
McCallum, Nigam, & Ungar 2000
Pasula et. al. 2003
Wellner et. al. 2004
 Recent work has looked at exploiting semantic
constraints
– Personal Information Management (Dong et. al. 2004)
– Profiler based entity matching (Doan et. al. 2003)
 Semantic constraints successfully exploited in other
applications
– Clustering algorithms (Bilenko et. al. 2004), ontology matching (Doan et. al.
2002)
20
Summary and Future Work
 Exploit semantic constraints in entity matching
– Models constraints in a uniform probabilistic manner
– Uses a generative model and relaxation labeling to
handle constraints in a scalable way
– Experimental results on two real-world domains show
effectiveness
 Future work: Learning constraints effectively from

current or external data
21

Cluster Ppt5

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cluster Ppt5

Hochgeladen von

Copyright:

Verfügbare Formate

Constraint-Based Entity Matching

Warren Shen, Xin Li, AnHai Doan

Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001

 Key problem in numerous applications

 Mostsolutions largely exploit only syntactic

C. Li. “User Interfaces”. SIGCHI 2000

DBLP Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001

 Experimental results on two real-world domains

 Constraint probabilities have a natural interpretation

 Can be learned or manually specified by a domain expert

Constraints: c1 = layout constraint p(c1) = 0.8

m2:C. Li m3: Chris Lee

 Difficult to compute exactly, so use a variant of EM

θ1 = arg max (  , 0 θ ) 1 = arg max (  ,  θ1 )

θ 2 = arg max (  , 1 θ ) 2 = arg max (  ,  θ 2 ) ...

( *,θ *) = arg max (  , θ )

– Basic mention assignment:

– Extension: Use constraints to improve mention

Compute parameters Assign mentions Apply constraints

Constraints: c1 = layout constraint p(c1) = 0.8

 Can be extended to probabilistic constraints

Constraints: c1 = layout constraint p(c1) = 0.8

 Can be extended to probabilistic constraints

 It may be unclear how to enforce constraints on clusters

Li, Chen Chen Li Constraint: C. Li ≠ Li, C.

 Add a pairwise layer

 For each domain

 Relaxation labeling very fast

 Future work: Learning constraints effectively from

Das könnte Ihnen auch gefallen

( ,θ ) = arg max (  , θ )