Sie sind auf Seite 1von 2

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

HYDERABAD CAMPUS
FIRST SEMESTER 2015 2016
INFORMATION RETRIEVAL (CS F469)
MID SEMESTER EXAM
Date: 06.10.2015 Weightage: 30 %( 60 M)
Duration: 90min. Type: Closed Book
Note: Answer all parts of the question together.
Answers must be brief.
Q1. [1+1.5+2.5 = 5 Marks]
a) Why is the grep command not preferred for information retrieval?
b) Using the Porter algorithm, the following pairs are mapped to the same root:
abandon - abandoned
university - universe
volume - volumes
Explain in one line which of these cases make sense and which might be critical?
c) Construct an inverted index for the following document collection.
D1 new home sales top forecasts
D2 home sales rise in july
D3 increase in home sales in july
D4 july new home sales rise
d) Given the following positional index [2+2 = 4 Marks]
ANGELS: 2: <36,174,252,651>; 4: <12,22,102,432>; 7: <17>;
FOOLS: 2: <1,17,74,222>; 4: <8,78,108,458>; 7: <3,13,23,193>;
FEAR: 2: <87,704,722,901>; 4: <13,43,113,433>; 7: <18,328,528>;
IN: 2: <3,37,76,444,851>; 4: <10,20,110,470,500>; 7: <5,15,25,195>;
RUSH: 2: <2,66,194,321,702>; 4: <9,69,149,429,569>; 7: <4,14,404>;
TO: 2: <47,86,234,999>; 4: <14,24,774,944>; 7: <199,319,599,709>;
TREAD: 2: <57,94,333>; 4: <15,35,155>; 7: <20,320>;
WHERE: 2: <67,124,393,1001>; 4: <11,41,101,421,431>; 7: <15,35,735>;
i. Which documents match the phrase query FOOLS RUSH IN AND ANGLES FEAR
TO TREAD?
ii. There is something wrong with this positional index. What is the problem?

e) Give the name of the index we need to use if [2 Marks]


i. We want to consider word order in the queries and the documents for a random number of
words?
ii. What kind of Index can we use if we assume that word order is only important for two
consecutive terms?

f) Given a two-word query. The postings list of one term consists of the following 16 entries:
[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180] and for the other it is the one entry
postings list: [47].
How many comparisons would be done to intersect the two postings lists with the following
two strategies. Briefly justify your answers: [1+2=3 Marks]
i. Using standard postings lists
ii. Using postings lists stored with skip pointers, with a skip length as discussed in the class.

Q2. Tolerant retrieval [4*3 = 12 Marks]


a) Consider the following sentence silly billy sally sat on the hill
i. What is the total number of bigram dictionary entries that will be generated by the above
text? List the first five entries in the dictionary.
ii. How would the wild-card query si*y be best expressed as an AND query using the bigram
index you have constructed? Think about the most efficient query in terms of the number of
posting entries traversed. For simplicity, answer by giving a list of bigrams ordered based
on their postings list sizes (one line, space separated). For example, you should write ab cd
ef to refer to the Boolean query ((ab AND cd) AND ef ).
iii. How many posting entries are traversed for the most efficient query in question ii?
b) What is soundex code for the following two names Robert and Rupert? If the alphabets are
mapped to numbers as follows (B, F, P, V 1)(C, G, J, K, Q, S, X, Z 2 )(D,T 3)
(L 4) (M, N 5) (R 6) [2 Marks]

Q3. Ranked Retrieval [2+3+3= 8 Marks]


Consider a sample corpus consisting of three documents given below
D1: Hindustan Hamara Hai
D2: Hindustan Hamara jaiho
D3: ye gulistaan hai
Q: Hindustan Hindustan Hai
a) What are the similarity scores between the query(Q) with each document given above using
Jaccard coefficient?
b) Compute the tf-idf score for each term in the document.
c) Compute the cosine score between and the query(Q) with each document using the tf-idf of
the terms computed in question b and rank the documents. To compute the tf-idf of terms in
the query consider the number of documents to be 1 and tf of each term.

Q4.Probablistic Information Retrieval [6 Marks]


a) The following table shows the presence or absence of terms in three documents and two
queries. Compute the odds (di Rq).
T1 T2 T3 T4
D1 1 1 1 0
D2 0 1 0 1
D3 1 0 1 1
Q1 1 1 0 0
Q2 0 1 1 0

Q5. Evaluation of IR systems [2+3+1=6 Marks]


a) If an IR system returns 8 relevant documents, and 10 nonrelevant documents. There are a
total of 20 relevant documents in the collection. Calculate the precision and recall of the
system on this search?
b) Consider an information need for which there are 4 relevant documents in the collection.
Two queries are run on this collection and their results are shown below
Query1 R N R N N N N N R R
Query2 N R N N R R R N N N
Compute the Average Precision of each query and Mean Average Precision (MAP)?
c) If the MAP of one search engine is higher than the other what does it indicate? What does it
say about what is important in getting a good MAP score?

Q6.Languages Models and CLIR [4+2+6=12 Marks]


a) Consider the following sentence SILLY SILLY SAT ON A WALL assume a unigram
language model and calculate the probability of generating this sentence. Assume that your
machine can generate only these words and all words probability are equal.
b) What do you mean by Query translation in Cross language information retrieval?
c) Given the following parallel corpus compute the IBM Model1 parameters for two iterations
of EM algorithm.

e f
S1 green house hara ghar
S2 this house yeah ghar

Das könnte Ihnen auch gefallen