Sie sind auf Seite 1von 11

Problem Definition

Query string
s = “yotubecom” and τ = 2

ed(s, r4) <= 2


output r4 as a result
string dataset R
Application
• Spell Checking
• Copy Detection
• Entity Linking
• Bioinformatic
….
Filter-and-Verification Framework
Query string s

Index
Filter: No Verify:
Yes
Dataset R Signature(s) ∩ Results
ED(r,s) ≤ τ?
Signature(r) = ϕ?
Threshold τ
Filter-and-Verification Framework
Query string s

Index
Filter: No Verify: Yes
Dataset R Signature(s) ∩ alignment filter? Results
Signature(r) = ϕ? If yes, ED(r,s) ≤ τ?
Threshold τ

Complexity Improvement:
Improved from 𝑂(min 𝑟 , 𝑠 ∗ τ) to 𝑂(𝑞τ2)
Alignment Filter

Intuition of Alignment Filter:


suppose in the best case we need erri edit operations to
transform 𝑔𝑖 to a substring of r, then ED r, s > στ+1
𝑖=1 𝑒𝑟𝑟𝑖

If στ+1
𝑖=1 𝑒𝑟𝑟𝑖 > τ, ED r, s > τ
Alignment Filter
Substring edit distance (sed)
𝑠𝑒𝑑 𝑔𝑖 , 𝑟 is the minimum edit distance between 𝑔𝑖 and
any substring of r.

Alignment filter:
If στ+1
𝑖=1 𝑠𝑒𝑑(𝑔𝑖 , 𝑟) > τ, 𝐸𝐷 𝑟, 𝑠 > τ
Alignment Filter

Accelerating Calculation:
• The computation complexity of sed(𝑔𝑖, r) is O(q|r|).
• By position filter, 𝑔𝑖 can only align to a substring xi of r
where |xi|<2τ + 𝑞.
• Thus if στ+1
𝑖=1 𝑠𝑒𝑑(𝑔𝑖 , 𝑥𝑖 ) > τ, ED(𝑟, 𝑠)> τ.
• The complexity reduced to O qτ .

Complexity Improvement:
Improved from 𝑂(min 𝑟 , 𝑠 ∗ τ) to 𝑂(𝑞τ2)
Evaluating Alignment Filter
Average Search Time

NoFilter: without any filter


ContentFilter: From EDJoin
AlignFilter: Alignment Filter
Evaluating Alignment Filter
Candidate Number

NoFilter: without any filter


ContentFilter: From EDJoin
AlignFilter: Alignment Filter
Real: Number of results
Preliminary: Prefix Filter
Sort all q-grams by global ordering, such as idf

q(r) : The sorted q-gram set of string r


Pre(r)

g1 g2 g5 g6 g9 g10 g11

Pre(•) is the prefix of q(•)

|Pre(•)|= qτ+1
>g10 >g10 >g10 >g10 >g10 >g10 >g10
g3 g4 g7 g8 g11 g12 g13

Pre(s)
q(s): The sorted q-gram set of string s

Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ


Alignment Filter
non-consecutive errors:

youtubecom
yoytupecxm
q=3, the 3 non-consecutive errors destroy 8 q-grams

consecutive errors:
youtubecom
youtzpxcom
q=3, the 3 consecutive errors only destroy 5 q-grams

Das könnte Ihnen auch gefallen