Sie sind auf Seite 1von 5

Due January 19 2011 11:59pm

CS246: Mining Massive Data Sets


Problem Set 1 - Winter 2011
Submit all your homework electronically to your dropbox folder at http://coursework.
stanford.edu. For each homework you need to upload two les to your dropbox folder:
1) Your solution report. You can submit it as a .pdf or .doc. You do not need to typeset the
document. It can be handwritten/scanned. The report should be named as: <SUNetId> <Last-
Name> <First-Name> HW1 Ans (.doc or .pdf)
2) Any source code that you use towards obtaining your results stated in your solution report
should be submitted as a single ZIP le. Also include the mention of any tools whichever you
use towards your solution(s). Include all your source code in a le: <SUNetId> <Last-
Name> <First-Name> HW1 Code.zip
The Problems
Problem 1.1 MapReduce (25 points) [Peyman]
(a) [8pts] Write a MapReduce program in Hadoop that counts all the 4 letter sequences in a
text. The program should ignore all white spaces, digits and punctuations, except the end of
line and end of sentence (.) characters, as if they dont exist. End of sentence is a dot(.)
character right after a non-digit character. When the program reaches the end of sentence
or end of line, it should NOT make 4 letter sequences by concatenating letters from before
and after these characters. The program should also be case-incesitive. As an example, the
output of your code for the input Hello 4, World. Goodbye. should be as follows:
hell 1
ello 1
llow 1
lowo 1
owor 1
worl 1
orld 1
good 1
oodb 1
odby 1
dbye 1
2 CS 246: Mining Massive Data Sets - Problem Set 1
Test your code on le http://www.stanford.edu/class/cs246/cs246-11-mmds/hw files/
hw1q1.txt and nd the 3 most frequent 4 letter sequences. For each of the top 3 sequences
print rank, frequency and the sequence itself.
(b) [5pts] Suppose an undirected graph is described in a le by its edge set as follow: each
line of the le has the format node1 id#node2 id which shows that there is an edge be-
tween node1 and node2. Write the necessary MapReduce psudo-code(s) to nd all the paths
of length 5, given the input with the specied format. You can assume that node IDs
are integer numbers 1, . . . , N and each edge appear once in the input le (only one of
node1 id#node2 id node2 id#node1 id will be in the input le). The nal result should not
include any loops, but it is OK if it has duplicate paths.
(c) [2pts] What is the minimum number of MapReduce jobs required to compute all paths
of length L in part (b)? Justify your answer.
(d) [5pts] Matrix multiplication can be done by dividing each matrix into blocks and then
multiplying corresponding blocks. (see http://mathworld.wolfram.com/BlockMatrix.
html and http://en.wikipedia.org/wiki/Block matrix for details and examples). Sup-
pose that all the non zero elements of matrix A is written in a le in the following format:
row index, column index, value
Write the necessary MapReduce psudo-code(s) to compute AA
T
using block multiplication.
You may assume we have m blocks in each column and n blocks in each row.
(e) [5pts] What are the trade-os in choosing m and n? Hint: Discuss how the number
of mappers and reducers change as we increase/decrease m, n? How many times each el-
ement in the matrix will be copied over network depending on m, n? How the memory
requirement in reducers for storing temporary data are aected by m and n?
Problem 1.2 Locality Sensitive Hashing (15 points) [Aditya]
An alternative denition for locality sensitive hashing schemes is:
Denition A locality sensitive hashing scheme is a set T of hash functions that operate on
a set of objects, such that for two objects x, y,
Pr
hF
[h(x) = h(y)] = sim(x, y) (1)
where sim is a similarity function that maps a pair of objects x, y to a single number in
[0, 1].
CS 246: Mining Massive Data Sets - Problem Set 1 3
For this denition, answer the following questions:
(a)[5pts] (Necessary Condition) For a similarity function sim to have a locality sensitive
hashing scheme of the form given above, prove that the function d(x, y) = 1 sim(x, y) has
to satisfy the triangle inequality. (Hint: Triangle inequality is d(x, y) +d(y, z) d(x, z), for
all x, y, z.)
(b)[5pts] By means of a counterexample, show that there is no locality sensitive hashing
scheme for the Overlap similarity function:
sim
Over
(A, B) =
[A B[
min([A[, [B[)
,
where A, B are two sets.
(c)[5pts] By means of a counterexample, show that there is no locality sensitive hashing
scheme for the Dice similarity function:
sim
Dice
(A, B) =
[A B[
1
2
([A[ +[B[)
,
where A, B are two sets.
Problem 1.3 Association Rules (35 points) [Aditya]
In this problem, we aim to study some experimental properties of the association rule mining
algorithm. We focus on the rst step, i.e., nding frequent itemsets.
Download the dataset at http://fimi.cs.helsinki.fi/data/retail.dat. Each line rep-
resents a basket or a transaction. There are exactly 16470 items (0-16469), and 88162
transactions.
Implement the a-priori algorithm discussed in class.
(a)[5pts] Fix the support threshold to 2%. (In other words, an itemset must occur in
2
100
of 88162| = 1763 transactions to be frequent.) For the entire set of transactions, nd the
number of frequent itemsets of each size 1, 2 and 3.
(b)[10pts] Fix the support threshold to 2%. (In other words, an itemset must occur in
2
100
of 88162| = 1763 transactions to be frequent.) For the entire set of transactions, retrieve all
frequent itemsets of sizes 1, 2 and 3. Let this set be S.
Now, for each i 1, 2, . . . , 8, repeat the experiment with the rst (11000 i) transactions
and retrieve the frequent itemsets with the same support threshold (i.e., the itemsets that
occur in
2
100
of 11000 i| transactions) of sizes 1, 2, and 3. Let this set be S
i
.
Plot precision (y-axis) and recall (y-axis) versus i (x-axis). Precision is dened as
|SS
i
|
|S
i
|
,
while recall is dened as
|SS
i
|
|S|
.
4 CS 246: Mining Massive Data Sets - Problem Set 1
Explain any interesting behavior (if present).
(c)[10pts] Fix the entire set of transactions. For each i 1, 2, . . . , 8, retrieve the frequent
itemsets of sizes 1, 2 and 3 on setting support threshold to be (0.5 i)%. Let this set be S
i
.
Plot [S
i
[ (y-axis) versus i (x-axis).
Explain any interesting behavior (if present).
(d)[10pts] (Relationship between rules) Lets say the set of items is S = i
1
, i
2
, i
3
, i
4
, i
5
, i
6
, i
7
.
Given that i
1
i
2
= i
3
and i
3
i
4
= i
5
i
6
are two association rules, while i
2
i
5
= i
6
i
7
is
not an association rule, for each of the following, write whether they must be, may be, or
cannot be an association rule, and why:
i
1
= i
3
i
1
i
2
i
4
= i
3
i
3
= i
5
i
6
i
3
i
4
= i
5
i
2
i
5
i
7
= i
6
Problem 1.4 LSH for Approximate Near Neighbor Search (25 points) [Bahman]
In this problem, we study the application of LSH to the problem of nding approximate near
neighbors.
Assume we have a dataset / of n points. Consider the (c, )-Approximate Near Neighbor
(ANN) problem: Given a query point z, assuming there is a point x in the dataset with
d(x, z) , return a point x

from the dataset with d(x

, z) c. Here c > 1 is the


maximum approximation factor allowed in the problem.
Assume the LSH family Hof hash functions is (, c, p
1
, p
2
)-sensitive for the distance measure
d(, ). Let ( = H
k
= g = (h
1
, . . . , h
k
)[h
i
H, where k = log
1/p
2
n. We take L = n

random members g
1
, . . . , g
L
of (, where =
log 1/p
1
log 1/p
2
, and hash all the data points as well as
the query point using all g
i
s (1 i L). Then, to nd an approximate near neighbor,
we retrieve at most 3L data points from the buckets g
j
(z) (1 j L), and report the
closest one as a (c, )-ANN. We would like to prove that the reported answer is correct, with
constant probability.
(a)[6pts] Dene W
j
= x /[g
j
(x) = g
j
(z) (1 j L), and T = x /[d(x, z) > c.
Prove:
Pr[
L

j=1
[T

W
j
[ > 3L] <
1
3
CS 246: Mining Massive Data Sets - Problem Set 1 5
(b)[4pts] Let x

/ be a point such that d(x

, z) . Prove:
Pr[g
j
(x

) ,= g
j
(z) (1 j L)] <
1
e
(c)[3pts] Conclude that with a constant probability the reported point is an actual (c, )-
ANN.
(d)[12pts] A dataset of images
1
, patches.mat, is provided in:
http://www.stanford.edu/class/cs246/cs246-11-mmds/lsh.zip
Each column in this dataset is a 2020 image patch represented as a 400-dimensional vector.
We would like to compare the performance of LSH-based approximate near neighbor search
with that linear search. You should use the code provided with the dataset for this task.
The included ReadMe.txt le explains how to use the provided code. In particular, you will
need to use the functions lsh and lshlookup. We will use the L
1
distance measure, and the
corresponding LSH with L = 10, k = 24. Feel free to use other parameter values, but make
sure you explain the reason behind your parameters choice. Then, for each of the image
patches in columns 100, 200, 300, . . . , 1000, nd the top 3 near neighbors using both LSH
and linear search. What is the average search time for LSH? What about for linear search?
Assuming z
j
(1 j 10) to be the set of image patches considered (i.e., z
j
is the image
patch in column 100j), x
ij

3
i=1
to be the approximate near neighbors of z
j
found using LSH,
and x

ij

3
i=1
to be the (true) top 3 near neighbors of z
j
found using linear search, compute
the following error measure:
error =
1
10
10

j=1

3
i=1
d(x
ij
, z
j
)

3
i=1
d(x

ij
, z
j
)
Plot the error value as a function of L (for L = 10, 12, 14, . . . , 20, with k = 24). Similarly,
plot the error value as a function of k (for k = 16, 18, 20, 22, 24 with L = 10).
Finally, plot the top 10 near neighbors found using the two methods (using the default
L = 10, k = 24 (or your alternative choice of parameter values) for LSH) for one or more of
the image patches, together with the image patch itself. How do they compare visually?
1
Dataset and code adopted from Brown Universitys Greg Shakhnarovich

Das könnte Ihnen auch gefallen