Sie sind auf Seite 1von 10

Université d’Ottawa University of Ottawa

Faculté de genie Faculty of Engineering

École d'ingénierie et de technologie School of Information Technology


de l'information and Engineering

CSI 4107
Information Retrieval and the Internet
FINAL EXAMINATION

Length of Examination: 3 hours April 28, 2006, 14:00


Professor: Diana Inkpen Page 1 of 10

Family Name: _________________________________________________

Other Names: _________________________________________________

Student Number: ___________

Signature ____________________________

Important Regulations:
1. Calculators are allowed.
2. A student identification cards (or another photo ID and signature) is required.
3. An attendance sheet shall be circulated and should be signed by each student.
4. Please answer all questions on this paper, in the indicated spaces.

Marks:
A B C D Total

10 10 8 12 40
CSI 4107 – April 2006 Final Examination Page 2 of 10
Part A [10 marks]
Short answers and explanations.

1. (2 marks) Explain how could you use text categorization and text clustering together to implement a
text classifier? Assume you have a few labeled documents (manually annotated with the class label),
and many documents that are not labeled. Assume there are two classes, C1 and C2.

Use the labeled documents to produce very good centroids for the two clusters, by averaging
the labeled documents for each of the two classes.
Then cluster the unlabeled documents in the two clusters.
Train a classifier using as training data the documents from the two clusters; assume the
clusters give them labels (C1 or C2).

2. (1 mark) What is one way in which Zipf’s Law manifests itself on the Web?

The number of links of webpages has a zipfian distribution.


The number of hits to webpages too.

3. (2 marks) Compute the longest common subsequence between the following strings. Normalize by
the length of the longest string. (A subsequence of a string is obtained by deleting zero or more
characters.)

“alabaster”
“albastru”

Longest common subsequence “al.bast.r.” is 7


normalized by the longest string, the similarity is 7/9 = 0.77

CSI 4107 – April 2006 Final Examination Page 3 of 10


4. (2 marks) Write a regular expression to describe an URL, in generic terms. Cover URLs of the
following types:
http://www.ibm.com
http://www.uottawa.ca/welcome.html
http://www.site.uottawa.ca/~diana/csi4107/
http://www.site.uottawa.ca/~diana/eacl2006-clki-workshop.html#committee

http:://www.(\w+.)*\w\w\w?(/\w+)*(/\w+.html?(#\w+)?)?

+ means 1 or more repetitions


* zero or more
? 0 or 1 occurrence
\w alphanumeric character
( ) for grouping

5. (2 marks) Explain how do you compute probabilities of relevance and non-relevance of a document
to a query in the probabilistic retrieval model. Explain the idea, not the formulas. How do you move
from the probability of a document to counts of terms? How do you deal with zero counts?

- assume R is the relevant set, start with guess and improve iteratively
- use Bayes rule to move from P(R|d) to P(d|R)
- assume terms are independent, to express P(d|R) in terms of probability of occurrence
of each term (and non-occurrence)
- smooth zero counts by adding a small constant

6. (1 mark) What is the main advantage of Latent Semantic Indexing?

The main advantage is that is groups terms into concepts, therefore allows retrieval of documents
that do not contain any query keyword if they contain synonyms of the keywords.

CSI 4107 – April 2006 Final Examination Page 4 of 10


Part B [10 marks]

Evaluation of IR systems

The following is the uninterpolated Precision-Recall Graph of an IR system, for one topic (the blue
line). You know that 20 hits were retrieved, and that there are 16 relevant documents for this topic (not
all of which are retrieved).

A. (2 points) What does the interpolated graph look like? Draw neatly on the graph above.

- see the pink line on the figure

B. (4 points) In the diagram below, each box represents a hit. Based on the above Precision-Recall
graph, which hits are relevant? Write an “R” on the relevant hits. Leave the non-relevant hits alone.

CSI 4107 – April 2006 Final Examination Page 5 of 10


C. (2 points) What is the Uninterpolated Average Precision?
(defined as sum of the precision values for the points where relevant documents are found divided by
how many relevant documents exist)

(1 + 2/3 + 3/4 + 4/8 + 5/9 + 6/10 + 7/13 + 8/14 + 9/18) / 16 ≈ 0.36

D. (1 points) What is the R-precision?

8/16 = 0.5

E. (1 points) What is Precision at 10?

6/10 = 0.6

CSI 4107 – April 2006 Final Examination Page 6 of 10


Part C [8 marks]
Consider the following web pages and the links between them:
Page A points to pages B, and C.
Page B points to page E.
Page C points to pages B and D.
Page D points to page E.
Page E points to pages A and C.

Show the order in which the pages are indexed when starting at page A and using a spider (with
duplicate page detection). Assume links on a page are examined in the order given above.

1. What is the traversal order for a breadth-first spider?

Order A, B, C, E, D

2. What is the traversal order for a depth-first spider?

Order
A, B, E, C, D

3. Do you recommend breadth-first or depth-first strategy? Why?

Recommended Breadth-first
Why? - covers to top-level part of a website
- doesn’t get lost

4. What other traversal strategy might be better?

Better traversal strategy Ordered by PageRank or by topic relevance

CSI 4107 – April 2006 Final Examination Page 7 of 10


Part D [11 marks]

1. Assume the following documents are in the training set, classified into two classes:

Dog: “cute puppy”


Dog: “big puppy”
Cat: “cute white kitten”
Cat: “white kitten”

a. (4 marks) Apply the Rocchio algorithm to classify a new document: “white puppy”
Use tf without idf and normalization, with cosine similarity. Assume the words in the vectors are
ordered alphabetically. Show the prototype vectors for the two classes, and their similarities to the new
document.

big cute kitten puppy white


d1 0 1 0 1 0
d2 1 0 0 1 0
d3 0 1 1 0 1
d4 0 0 1 0 1
q 0 0 0 1 1

Dog = p1 = d1 + d2 = <1,1,0,2,0> length sqrt(6)


Cat = p2 = d3 + d4 = <0,1,2,0,2> length 3

cosSim(q, p1) = 2 / (sqrt (2) sqrt(6)) => New document is Dog (more similar to p1)
cosSim(q, p2) = 2 / (3 sqrt (2))

b. (3 marks) Apply kNN with k=3 to classify the new document: “white puppy”
Use tf without idf and normalization, with cosine similarity.
Would the result be the same if k=1? Why?

cosSim(q, d1) = 1 / (sqrt (2) sqrt(2)) = 1/2

cosSim(q, d2) = 1 / (sqrt (2) sqrt(2)) = 1/2

cosSim(q, d3) = 1 / (sqrt (2) sqrt(3))

cosSim(q, d4) = 1 / (sqrt (2) sqrt(2)) = 1/2

With K=3 the closest documents are d1, d2, d4 => New document is classified as Dog
With K=1 the closest document is d1 => New document is classified as Dog
(assuming we solve ties by taking the first neighbour - with the lowest index dj)

CSI 4107 – April 2006 Final Examination Page 8 of 10


2. (5 marks) Cluster to following documents using K-means with K=2 and cosine similarity.

Doc1: “fly eagle”


Doc2: “go eagle fly”
Doc3: “eagle fly eagle”
Doc4: “go fly”
Doc5: “go eagle”

Assume Doc1 and Doc4 are chosen as initial seeds. Use tf (without idf and normalization) and cosine
similarity. Assume the words in the vectors are ordered alphabetically.
Show the clusters and their centroids for each iteration. How many iterations are needed for the
algorithm to converge?

eagle fly go length


d1 1 1 0 sqrt(2)
d2 1 1 1 sqrt(3)
d3 2 1 0 sqrt(5)
d4 0 1 1 sqrt(2)
d5 1 0 1 sqrt(2)
---------------------------------------
Iteration 1
c1=d1 1 1 0 sqrt(2)
c2=d4 0 1 1 sqrt(2)

cosSim(d1,c1)= 1 > cosSim(d1,c2)= 1 / (sqrt(2)sqrt(2))


cosSim(d2,c1)= 2 / (sqrt(2) sqrt(3)) = cosSim(d2,c2)= 2 / (sqrt(2) sqrt(3))
cosSim(d3,c1)= 3/ (sqrt(3) sqrt(5)) > cosSim(d3,c2)= 1 / (sqrt(2) sqrt(5))
cosSim(d4,c1)= 1 / (srqt(2) sqrt(2)) < cosSim(d4,c2)= 1
cosSim(d5,c1)= 1 / 2 = cosSim(d5,c2)= 1/2

(solve ties by choosing c1)

=> d1, d2, d3, d5 in c1 d4 in c2


---------------------------------------
Iteration 2
c1’ = (d1+d2+d3+d5) /3 = <5/4,3/4,2/4> length sqrt(38)/4
c2’ = d4 = <0,1,1> sqrt(2)

cosSim(d1,c1’)=(8/4)/(sqrt(38) sqrt(2) /4)=0.91 > cosSim(d1,c2’)=1/2


cosSim(d2,c1’)=10/sqrt(114) = 0.93 > cosSim(d2,c2’)= 2/sqrt(6) = 0.81
cosSim(d3,c1’)= 13/sqrt(190) = 0.94 > cosSim(d3,c2’)= 1/sqrt(10) = 0.31
cosSim(d4,c1’)= 5/sqrt(76) = 0.57 < cosSim(d4,c2’)= 1
cosSim(d5,c1’)= 7/sqrt(76)=0.8 > cosSim(d5,c2’)= 1/2

=> d1,d2,d3,d5 in c1’ d4 in c2’


The algorithm converged in 2 iterations.

CSI 4107 – April 2006 Final Examination Page 9 of 10


CSI 4107 – April 2006 Final Examination Page 10 of 10

Das könnte Ihnen auch gefallen