Sie sind auf Seite 1von 4

Unit 6 1. Which are different techniques of document indexing ? 2. Compare data retrieval and information retrieval ? 3.

What are inverted index and signature file ? 4. ote on !ndexing of documents

". Wh# $ffective index structure is important ? 6.What is %e& cra%ler? 'ns( We& cra%ler are programs that locate and gather information on %e&. )he# recursivel# follo% h#perlin*s present in *no%n documents to find other documents. ' cra%ler retrieves the document and adds info. +ound the documents to a com&ined index, the document is generall# not storedalthough some search engines do cache a cop# of the document to give clients a faster access. .. /escri&e We& search $ngine. 'ns( 0ince the num&er of documents on theWe& is ver# large- it is not possi&le to cra%l the %hole We& in a short period of time, and in fact- all search engines cover onl# some portions of theWe&- not all of it- and their cra%lers ma# ta*e %ee*s or months to perform a single cra%l of all the pages the# cover. )here are usuall# man# processes- running on multiple machines- involved in cra%ling. ' data&ase stores a set of lin*s 1or sites2 to &e cra%led, it assigns lin*s from this set to each cra%ler process. e% lin*s found during a cra%l are added to the data&ase- and ma# &e cra%led later if the# are not cra%led immediatel#. 3ages found during a cra%l are also handed over to an indexing s#stem- %hich ma# &e running on a different machine. 3ages have to &e refetched 1that is- lin*s recra%led2 periodicall# to o&tain updated information- and to discard sites that no longer exist- so that the information in the search index is *ept reasona&l# up to date. )he indexing s#stem itself runs on multiple machines in parallel. !t is not a good idea to add pages to the same index that is &eing used for queries- since doing so %ould require concurrenc# control on the index- and affect quer# and update performance. !nstead- one cop# of the index is used to ans%er

queries %hile another cop# is updated %ith ne%l# cra%led pages. 't periodic intervals the copies s%itch over- %ith the old one &eing updated %hile the ne% cop# is &eing used for queries. 4. What is question ans%ering s#stem ? 'ns( 5uestion ans%ering s#stems attempt to provide direct ans%ers to questions posed &# users. )he# are targeted at info on %e& t#picall# generated one or more *e#%ord queries from a su&mitted questionexecute the *e#%ord queries on against %e& search engines- and parsed returned documents that ans%er the question. 6. /escri&e distinct %a#s a user can find information on the %e&? 'ns( 12 !nformation $xtraction. 22 5uer#ing 0tructured data 32 5uestion 'ns%ering. 172What do We& 0earch engines do ? /escri&e in one line. 'ns( We& search engines cra%l the %e& to find pages- anal#8e them to compute prestige measures- and index them. 51(9W:') !0 0; < ;=? ' 01(9s#non#m means the %ords having the same meaning &ut different representation 52(9W:') !0 :<=< ;=0? ' 02(9:omon#m means the %ords having the same pronounciation &ut &ifferent meanings 53(9W:') !0 < )<><?!$0? ' 03(9!t is the process to overcome the limitation of *e#%ord &ased search 54(9$@3>'! 0; < ;= W!): ):$ :$>3 <+ $@'=3>$? ' 04(90#non#m is the collections of the %ords having the same meaning &ut different representation for eg.Amotorc#cle repairA B Amotorc#cle representationAetc. 5"(9$@3>'! :<=< ;= W!): ):$ :$>3 <+ $@'=3>$? ' 0"(9homon#m is the collections of the %ords having the different meaning &ut same pronounciation

for eg. AhairA and AhareAetc. 5.1 :o% is relevance ran*ing calculated using )+? '. We use the frequenc# of occurance 1 that is ho% man# times that particular term has occurred 2 of the term in the document as a measure of its relevance. <ne %a# of measuring )+ 1d-t2 i.e. )erm +requenc# or the relevance of the document to a term t is )+1d-t2 B log 11C n1d-t2Dn1d22 5.2What is the use of !nformation Eetrieval 0#stem? '. !nformation Eetrieval 0#stem is intended to support people %ho are activel# see*ing or searching for information- as in internet searching. !nformation Eetrieval t#picall# assumes a static or relativel# static data&ase- against %hich people search. 5.3 $xplain 0imillarit# &ased Eetireval 0#stem. '. 0imillart# &ased Eetrieval relies on &est match rather than exact match and uses techniques to compute the similarities &et%een the quer# and information items. 's the user information needs are also fu88#- an important characteristeic for this class or Eetrieval )echnique is its support for the iterative process of retrieval. 5.4 What is cosine 0imillarit#? 5." What is the use of similarit# &ased Eetrieval 0#stem? 5.1. What is the difference &et%een !nformation Eetrieval and /ata Eetrieval? '. 1. /ata Eetrieval 0#stem gives an exact match of the search elements- %hereas- !nformation Eetrieval 0#stem gives partial or &est match results. 2. 5uer# language in /ata Eetrieval 0#stem is 'rtficial- %hereasatural language is used in !nformation Eetrieval 0#stem. 3. Complete 5uer# specification is required in /ate Eetrieval 0#stem- %hereas- partial 5uer# specification %or*s in !nformation Eetrieval 0#stem.

5.2. $xplain the components of information Eetrieval 0#stem. '. )he t#pical components of !nformation Eetrieval 0#stem are ( 1. !nput 2. 3rocessor 3. <utput 5.3. %hat is Eelevance? :o% is it calculated? '. Eelevance can &e calculated as the cosine &et%een the t%o vectorsi.e. their cross product divided &# the square roots of the squares of each vector. )his measure varies &et%een 7 and 1. 5.4. %hat is )+9!/+? 1)erm +requenc# F !nverse /ocument +requenc# 2 '. ' measure of the frequenc# of occurrence of a particular term in a particular document as %ell as ho% often that term occurs in the entire collection of interest. 5.". :o% is )+ F !/+ used? What is the need? '. !f a term occurs frequentl# in one document &ut also occurs frequentl# in ever# other document in the collection then it is not a ver# important t %ord and the )+9!/+ measure reduces the %eight placed on it. ' common term is considered less important than the rare terms. !f a term occurs in ever# document then the inverse document frequenc# is 8ero.D !f it occurs in half of the documents- it %ill &e 7.3- and if it occurs in 27 of 17777 documents- it %ill &e 2.6 5.6. !llustrate the components of !nformation Eetrieval 0#stem using /iagram. 5... !nformation Eetrieval 0#stem is &est match or partial match- %hereasdata Eetrieval 0#stem is exact match. $xpand.

Das könnte Ihnen auch gefallen