Beruflich Dokumente
Kultur Dokumente
2
State of the Art
Numerous solutions in the AI, Database, and Web
communities
– Cohen, Ravikumar, & Fienberg 2003
– Li, Morie, & Roth 2004
– Bhattacharya & Getoor 2004
– McCallum, Nigam, & Ungar 2000
– Pasula et. al. 2003
– Wellner et. al. 2004
3
Semantic Constraints
Incompatible
Subsumption
Chris Li’s Homepage “Numerical Analysis”, SIAM 2001 with J. Smith.
Layout
Chen Li, Doug Chan. “Ensemble Learning”. ICML 2003
Chen Li’s Homepage
C. Li. “Data Mining”. KDD 2000
4
Numerous Semantic Constraint Types
Type Example
Aggregate No researcher has chaired more than 3 conferences in a year
Subsumption If a citation X from DBLP matches a citation Y in a homepage, then each author in
Y matches some author in X
Neighborhood If authors X and Y share similar names and some coauthors, they are likely to
match
Incompatible No researcher exists who has published in both HCI and numerical analysis
Layout If two mentions in the same document share similar names, they are likely to
match
Uniqueness Mentions in the PC listing of a conference refer to different researchers
Ordering If two citations match,then their authors will be matched in order
Individual The researcher named “Mayssam Saria” has fewer than five mentions in DBLP
(e.g. being a new graduate student with fewer than five papers)
5
Our Contributions
Develop a solution to exploit semantic constraints
– Models constraints in a uniform probabilistic manner
– Clusters mentions using a generative model
– Uses relaxation labeling to handle constraints
– Adds a pairwise layer to further improve accuracy
6
Probabilistic Modeling of Constraints
Modeled as the effect on the probability that a mention
refers to a real-world entity
“If two mentions in the same document share similar names, they are
likely to match”:
m1: Chen Li → e1
P (m2=e1 | m1 = e1) = 0.8
m2: C. Li
7
The Entity Matching Problem
m1:Chen Li
Documents:
m2:C. Li m3:Chris Lee
d1 d2
Matching Pairs: m1 = m2
Solution
l Model document generation
l Cluster mentions using this model
8
Modeling Document Generation
Generate mentions for each document
– Select entities
– Generate and “sprinkle” mentions
Check constraints for each mention
– Decide whether to enforce
constraint c E θ
– If enforced, check if
e1 e2 e2
mention violates c
Chen Li Chris Lee Chris Lee
– If yes, discard documents
and repeat process m1:Chen Li
10
Incorporating Constraints
Extend the step that assigns mentions
Ft = arg max P ( D, F θ t )
F
11
Enforcing Constraints on Clusters
Apply constraints at each iteration
Use
relaxation labeling to apply constraints to
mention assignments
12
Relaxation Labeling
Start with an initial labeling of mentions with entities
Iteratively improve mention labels, given constraints
Chen Li = e1 Chris Lee = e2
C. Li = e2 Jane Smith = e4
Y. Lee = e3
C. Lee = e2
Smith, J = e4
13
Relaxation Labeling
Start with an initial labeling of mentions with entities
Iteratively improve mention labels, given constraints
Chen Li = e1 Chris Lee = e2
C. Li = e2 → e1 Jane Smith = e4
Y. Lee = e3
C. Lee = e2
Smith, J = e4
14
Handling Probabilistic Constraints
Relaxation labeling can combine multiple
probabilistic constraints
P(m = e) = ∑ P (m = e, Om )
Om
= ∑ P(m = e | Om )P(Om )
Om
= ∑ P(m = e | f1 , L , f n )P(Om )
Om
n
∝ ∑ σ ∑ α k f k (Om , m, e) × ∏ P(mi = ei )
Om k =1 ( mi =ei )∈Om
15
Pairwise Layer
So far, we have applied constraints to clusters
Compute parameters Assign mentions Apply constraints
Evaluation Metric:
Precision = # true positives / # predicted pairs
Recall = # true positives / # correct pairs
F1 = (2 * P * R) / (P + R)
17
Using Constraints Improves Accuracy
Relaxation labeler improves F-1 by 3-12%
F1 (P / R) Researchers Movies
Baseline .66 (.67/.65) .69 (.61/.79)
Baseline + Relax .78 (.78/.78) .72 (.63/.83)
Baseline + Relax + Pairwise .79 (.80/.79) .73 (.64/.83)
18
Using Constraints Individually
Each constraint makes a contribution
Researchers F1 (P / R) Movies F1 (P / R)
Baseline .66 (.67/.65) Baseline .69 (.61/.79)
+ Rare Value .66 (.67/.66) + Incompatible .70 (.62/.79)
+ Subsumption .67 (.68/.65) + Neighborhood .70 (.62/.81)
+ Neighborhood .70 (.68/.72) + Individual .71 (.62/.82)
+ Individual .70 (.77/.64)
+ Layout .71 (.68/.74)
19
Related Work
Much work in entity matching
Cohen, Ravikumar, & Fienberg 2003
Li, Morie, & Roth 2004
Bhattacharya & Getoor 2004
McCallum, Nigam, & Ungar 2000
Pasula et. al. 2003
Wellner et. al. 2004
Recent work has looked at exploiting semantic
constraints
– Personal Information Management (Dong et. al. 2004)
– Profiler based entity matching (Doan et. al. 2003)
Semantic constraints successfully exploited in other
applications
– Clustering algorithms (Bilenko et. al. 2004), ontology matching (Doan et. al.
2002)
20
Summary and Future Work
Exploit semantic constraints in entity matching
– Models constraints in a uniform probabilistic manner
– Uses a generative model and relaxation labeling to
handle constraints in a scalable way
– Experimental results on two real-world domains show
effectiveness
21