CSI 4107 - Winter 2006 - Final

Université d’Ottawa University of Ottawa
Faculté de genie Faculty of Engineering
École d'ingénierie et de technologie School of Information Technology

de l'information and Engineering
CSI 4107
Information Retrieval and the Internet
FINAL EXAMINATION
Length of Examination: 3 hours April 28, 2006, 14:00

Professor: Diana Inkpen Page 1 of 10
Family Name: _________________________________________________
Other Names: _________________________________________________
Student Number: ___________
Signature ____________________________
Important Regulations:
1. Calculators are allowed.
2. A student identification cards (or another photo ID and signature) is required.
3. An attendance sheet shall be circulated and should be signed by each student.
4. Please answer all questions on this paper, in the indicated spaces.
Marks:
A B C D Total
10 10 8 12 40
CSI 4107 – April 2006 Final Examination Page 2 of 10
Part A [10 marks]
Short answers and explanations.
1. (2 marks) Explain how could you use text categorization and text clustering together to implement a
text classifier? Assume you have a few labeled documents (manually annotated with the class label),
and many documents that are not labeled. Assume there are two classes, C1 and C2.
Use the labeled documents to produce very good centroids for the two clusters, by averaging
the labeled documents for each of the two classes.
Then cluster the unlabeled documents in the two clusters.
Train a classifier using as training data the documents from the two clusters; assume the
clusters give them labels (C1 or C2).
2. (1 mark) What is one way in which Zipf’s Law manifests itself on the Web?
The number of links of webpages has a zipfian distribution.

The number of hits to webpages too.
3. (2 marks) Compute the longest common subsequence between the following strings. Normalize by
the length of the longest string. (A subsequence of a string is obtained by deleting zero or more
characters.)
“alabaster”
“albastru”
Longest common subsequence “al.bast.r.” is 7

normalized by the longest string, the similarity is 7/9 = 0.77

4. (2 marks) Write a regular expression to describe an URL, in generic terms. Cover URLs of the
following types:
http://www.ibm.com
http://www.uottawa.ca/welcome.html
http://www.site.uottawa.ca/~diana/csi4107/
http://www.site.uottawa.ca/~diana/eacl2006-clki-workshop.html#committee
http:://www.(\w+.)*\w\w\w?(/\w+)*(/\w+.html?(#\w+)?)?
+ means 1 or more repetitions

* zero or more
? 0 or 1 occurrence
\w alphanumeric character
( ) for grouping
5. (2 marks) Explain how do you compute probabilities of relevance and non-relevance of a document
to a query in the probabilistic retrieval model. Explain the idea, not the formulas. How do you move
from the probability of a document to counts of terms? How do you deal with zero counts?
- assume R is the relevant set, start with guess and improve iteratively
- use Bayes rule to move from P(R|d) to P(d|R)
- assume terms are independent, to express P(d|R) in terms of probability of occurrence
of each term (and non-occurrence)
- smooth zero counts by adding a small constant
6. (1 mark) What is the main advantage of Latent Semantic Indexing?
The main advantage is that is groups terms into concepts, therefore allows retrieval of documents
that do not contain any query keyword if they contain synonyms of the keywords.

Part B [10 marks]
Evaluation of IR systems
The following is the uninterpolated Precision-Recall Graph of an IR system, for one topic (the blue
line). You know that 20 hits were retrieved, and that there are 16 relevant documents for this topic (not
all of which are retrieved).
A. (2 points) What does the interpolated graph look like? Draw neatly on the graph above.
- see the pink line on the figure
B. (4 points) In the diagram below, each box represents a hit. Based on the above Precision-Recall
graph, which hits are relevant? Write an “R” on the relevant hits. Leave the non-relevant hits alone.

C. (2 points) What is the Uninterpolated Average Precision?
(defined as sum of the precision values for the points where relevant documents are found divided by
how many relevant documents exist)
(1 + 2/3 + 3/4 + 4/8 + 5/9 + 6/10 + 7/13 + 8/14 + 9/18) / 16 ≈ 0.36
D. (1 points) What is the R-precision?
8/16 = 0.5
E. (1 points) What is Precision at 10?
6/10 = 0.6

Part C [8 marks]
Consider the following web pages and the links between them:
Page A points to pages B, and C.
Page B points to page E.
Page C points to pages B and D.
Page D points to page E.
Page E points to pages A and C.
Show the order in which the pages are indexed when starting at page A and using a spider (with
duplicate page detection). Assume links on a page are examined in the order given above.
1. What is the traversal order for a breadth-first spider?
Order A, B, C, E, D
2. What is the traversal order for a depth-first spider?
Order
A, B, E, C, D
3. Do you recommend breadth-first or depth-first strategy? Why?
Recommended Breadth-first
Why? - covers to top-level part of a website
- doesn’t get lost
4. What other traversal strategy might be better?
Better traversal strategy Ordered by PageRank or by topic relevance

Part D [11 marks]
1. Assume the following documents are in the training set, classified into two classes:
Dog: “cute puppy”

Dog: “big puppy”
Cat: “cute white kitten”
Cat: “white kitten”
a. (4 marks) Apply the Rocchio algorithm to classify a new document: “white puppy”
Use tf without idf and normalization, with cosine similarity. Assume the words in the vectors are
ordered alphabetically. Show the prototype vectors for the two classes, and their similarities to the new
document.
big cute kitten puppy white

d1 0 1 0 1 0
d2 1 0 0 1 0
d3 0 1 1 0 1
d4 0 0 1 0 1
q 0 0 0 1 1
Dog = p1 = d1 + d2 = <1,1,0,2,0> length sqrt(6)

Cat = p2 = d3 + d4 = <0,1,2,0,2> length 3
cosSim(q, p1) = 2 / (sqrt (2) sqrt(6)) => New document is Dog (more similar to p1)
cosSim(q, p2) = 2 / (3 sqrt (2))
b. (3 marks) Apply kNN with k=3 to classify the new document: “white puppy”
Use tf without idf and normalization, with cosine similarity.
Would the result be the same if k=1? Why?
cosSim(q, d1) = 1 / (sqrt (2) sqrt(2)) = 1/2
cosSim(q, d3) = 1 / (sqrt (2) sqrt(3))
With K=3 the closest documents are d1, d2, d4 => New document is classified as Dog
With K=1 the closest document is d1 => New document is classified as Dog
(assuming we solve ties by taking the first neighbour - with the lowest index dj)

2. (5 marks) Cluster to following documents using K-means with K=2 and cosine similarity.
Doc1: “fly eagle”

Doc2: “go eagle fly”
Doc3: “eagle fly eagle”
Doc4: “go fly”
Doc5: “go eagle”
Assume Doc1 and Doc4 are chosen as initial seeds. Use tf (without idf and normalization) and cosine
similarity. Assume the words in the vectors are ordered alphabetically.
Show the clusters and their centroids for each iteration. How many iterations are needed for the
algorithm to converge?
eagle fly go length

d1 1 1 0 sqrt(2)
d2 1 1 1 sqrt(3)
d3 2 1 0 sqrt(5)
d4 0 1 1 sqrt(2)
d5 1 0 1 sqrt(2)
---------------------------------------
Iteration 1
c1=d1 1 1 0 sqrt(2)
c2=d4 0 1 1 sqrt(2)
cosSim(d1,c1)= 1 > cosSim(d1,c2)= 1 / (sqrt(2)sqrt(2))

cosSim(d2,c1)= 2 / (sqrt(2) sqrt(3)) = cosSim(d2,c2)= 2 / (sqrt(2) sqrt(3))
cosSim(d3,c1)= 3/ (sqrt(3) sqrt(5)) > cosSim(d3,c2)= 1 / (sqrt(2) sqrt(5))
cosSim(d4,c1)= 1 / (srqt(2) sqrt(2)) < cosSim(d4,c2)= 1
cosSim(d5,c1)= 1 / 2 = cosSim(d5,c2)= 1/2
(solve ties by choosing c1)
=> d1, d2, d3, d5 in c1 d4 in c2

---------------------------------------
Iteration 2
c1’ = (d1+d2+d3+d5) /3 = <5/4,3/4,2/4> length sqrt(38)/4
c2’ = d4 = <0,1,1> sqrt(2)
cosSim(d1,c1’)=(8/4)/(sqrt(38) sqrt(2) /4)=0.91 > cosSim(d1,c2’)=1/2

cosSim(d2,c1’)=10/sqrt(114) = 0.93 > cosSim(d2,c2’)= 2/sqrt(6) = 0.81
cosSim(d3,c1’)= 13/sqrt(190) = 0.94 > cosSim(d3,c2’)= 1/sqrt(10) = 0.31
cosSim(d4,c1’)= 5/sqrt(76) = 0.57 < cosSim(d4,c2’)= 1
cosSim(d5,c1’)= 7/sqrt(76)=0.8 > cosSim(d5,c2’)= 1/2
=> d1,d2,d3,d5 in c1’ d4 in c2’

The algorithm converged in 2 iterations.


CSI 4107 - Winter 2006 - Final

Hochgeladen von

Dokumentinformationen

Copyright

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

CSI 4107 - Winter 2006 - Final

Hochgeladen von

Copyright:

Université d’Ottawa University of Ottawa

Faculté de genie Faculty of Engineering

École d'ingénierie et de technologie School of Information Technology

Length of Examination: 3 hours April 28, 2006, 14:00

Family Name: _________________________________________________

Other Names: _________________________________________________

Student Number: ___________

The number of links of webpages has a zipfian distribution.

Longest common subsequence “al.bast.r.” is 7

CSI 4107 – April 2006 Final Examination Page 3 of 10

+ means 1 or more repetitions

6. (1 mark) What is the main advantage of Latent Semantic Indexing?

CSI 4107 – April 2006 Final Examination Page 4 of 10

- see the pink line on the figure

CSI 4107 – April 2006 Final Examination Page 5 of 10

(1 + 2/3 + 3/4 + 4/8 + 5/9 + 6/10 + 7/13 + 8/14 + 9/18) / 16 ≈ 0.36

D. (1 points) What is the R-precision?

E. (1 points) What is Precision at 10?

CSI 4107 – April 2006 Final Examination Page 6 of 10

1. What is the traversal order for a breadth-first spider?

2. What is the traversal order for a depth-first spider?

3. Do you recommend breadth-first or depth-first strategy? Why?

4. What other traversal strategy might be better?

Better traversal strategy Ordered by PageRank or by topic relevance

CSI 4107 – April 2006 Final Examination Page 7 of 10

Dog: “cute puppy”

big cute kitten puppy white

Dog = p1 = d1 + d2 = <1,1,0,2,0> length sqrt(6)

cosSim(q, d1) = 1 / (sqrt (2) sqrt(2)) = 1/2

cosSim(q, d2) = 1 / (sqrt (2) sqrt(2)) = 1/2

cosSim(q, d3) = 1 / (sqrt (2) sqrt(3))

cosSim(q, d4) = 1 / (sqrt (2) sqrt(2)) = 1/2

CSI 4107 – April 2006 Final Examination Page 8 of 10

Doc1: “fly eagle”

eagle fly go length

cosSim(d1,c1)= 1 > cosSim(d1,c2)= 1 / (sqrt(2)sqrt(2))

(solve ties by choosing c1)

=> d1, d2, d3, d5 in c1 d4 in c2

cosSim(d1,c1’)=(8/4)/(sqrt(38) sqrt(2) /4)=0.91 > cosSim(d1,c2’)=1/2

=> d1,d2,d3,d5 in c1’ d4 in c2’

CSI 4107 – April 2006 Final Examination Page 9 of 10

Das könnte Ihnen auch gefallen