Beruflich Dokumente
Kultur Dokumente
random members g
1
, . . . , g
L
of (, where =
log 1/p
1
log 1/p
2
, and hash all the data points as well as
the query point using all g
i
s (1 i L). Then, to nd an approximate near neighbor,
we retrieve at most 3L data points from the buckets g
j
(z) (1 j L), and report the
closest one as a (c, )-ANN. We would like to prove that the reported answer is correct, with
constant probability.
(a)[6pts] Dene W
j
= x /[g
j
(x) = g
j
(z) (1 j L), and T = x /[d(x, z) > c.
Prove:
Pr[
L
j=1
[T
W
j
[ > 3L] <
1
3
CS 246: Mining Massive Data Sets - Problem Set 1 5
(b)[4pts] Let x
, z) . Prove:
Pr[g
j
(x
) ,= g
j
(z) (1 j L)] <
1
e
(c)[3pts] Conclude that with a constant probability the reported point is an actual (c, )-
ANN.
(d)[12pts] A dataset of images
1
, patches.mat, is provided in:
http://www.stanford.edu/class/cs246/cs246-11-mmds/lsh.zip
Each column in this dataset is a 2020 image patch represented as a 400-dimensional vector.
We would like to compare the performance of LSH-based approximate near neighbor search
with that linear search. You should use the code provided with the dataset for this task.
The included ReadMe.txt le explains how to use the provided code. In particular, you will
need to use the functions lsh and lshlookup. We will use the L
1
distance measure, and the
corresponding LSH with L = 10, k = 24. Feel free to use other parameter values, but make
sure you explain the reason behind your parameters choice. Then, for each of the image
patches in columns 100, 200, 300, . . . , 1000, nd the top 3 near neighbors using both LSH
and linear search. What is the average search time for LSH? What about for linear search?
Assuming z
j
(1 j 10) to be the set of image patches considered (i.e., z
j
is the image
patch in column 100j), x
ij
3
i=1
to be the approximate near neighbors of z
j
found using LSH,
and x
ij
3
i=1
to be the (true) top 3 near neighbors of z
j
found using linear search, compute
the following error measure:
error =
1
10
10
j=1
3
i=1
d(x
ij
, z
j
)
3
i=1
d(x
ij
, z
j
)
Plot the error value as a function of L (for L = 10, 12, 14, . . . , 20, with k = 24). Similarly,
plot the error value as a function of k (for k = 16, 18, 20, 22, 24 with L = 10).
Finally, plot the top 10 near neighbors found using the two methods (using the default
L = 10, k = 24 (or your alternative choice of parameter values) for LSH) for one or more of
the image patches, together with the image patch itself. How do they compare visually?
1
Dataset and code adopted from Brown Universitys Greg Shakhnarovich