Beruflich Dokumente
Kultur Dokumente
BYRESEARCH SCHOLAR
1. Introduction
Information
retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. Automated information retrieval systems are used to reduce what has been called "information overload"
Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.
The problem of IR
Goal = find documents relevant to an
Query
Document collection
Retrieval
IR system
Answer list
Example
4
Web
Continuous refinement
2. Conceptual Models of IR
An IR conceptual model is a general approach to IR
systems. Faloutsos (1985) gives three basic approaches: text pattern search, inverted file search, and signature search. Belkin and Croft (1987) categorize IR conceptual models differently. They divide retrieval techniques first into exact match and inexact match.
Contd..
The exact match category contains text pattern
search and Boolean search techniques. Text pattern search queries are strings or regular expressions. Text pattern systems are more common for searching small collections, such as personal collections of files. In a Boolean IR System, documents are represented by sets of keywords, usually stored in an inverted file.
as probabilistic, vector space, and clustering, among others. It is possible to assign a probability of relevance to each documents in retrieved set allowing retrieved documents to be ranked in order of probable relevance. It is possible to group(cluster) documents based on the terms that they contain and to retrieve from these groups using a ranking methodology.
Query Operations
Term Operations Document operations
Hardware for IR
3. File Structures
A fundamental decision in the design of IR systems is which
memory, in practice IR databases are usually stored on disk because of their size.
field-ID>
Unique name that indicates from which field in the document the keyword came
represent documents. Signature Method: documents are split into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist, words. Each word in the block is hashed to give a signature a bit pattern with some of the bits set to 1. The signature of each word in a block are ORed together to create a block signature. The block signatures are then concatenated to produce the document signature. Searching is done by comparing the signatures of queries with document signatures.
connected by arcs. For example, a kind of graph called a semantic net can be used to represent the semantic relationships in text often lost in the indexing systems above. Graph based techniques for IR are impractical now because of the amount of manual effort that would be needed to represent a large document collection in this form.
4. Query Operations
Queries are formal statements of information needs put to
the IR system by users. The operations on queries are obviously a function of the type of query, and the capabilities of the IR system. One common query operation is parsing, that is breaking the query into its constituent elements. Boolean queries, for example, must be parsed into their constituent terms and operators. The set of document identifiers associated with each query term is retrieved, and the sets are then combined according to the Boolean operators.
Works, and use it to introduce the basics of the Boolean retrieval model.
Suppose we record for each document here a play
of Shakespeares whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words)
or columns, we can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it.
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100
The answers for this query are thus Antony and Cleopatra and Hamlet
5. Term Operations
Operations on terms in an IR system include stemming, truncation, weighting, and stop list and thesaurus operations. Stemming is the automated conflation (fusing or
combining) of related words, usually by reducing the words to a common root form. Take, taken, taking would result in take. Walk, walking, walked would result in walk. Computation, Computing, would result in compute.
wildcard characters in the word, so that the truncated term will match multiple words. Truncation allows you to search for various word endings and spellings simultaneously. It allows you to retrieve results with all the different endings of that root word.
thesaurus which lists synonymous terms, and sometimes the relationships among them.
A stoplist is a list of words considered to have no
indexing value, used to eliminate potential indexing terms. Each potential indexing term is checked against the stoplist and eliminated if found there.
doesn't doing don't during each else every it's its itself just
know most name need rather said same there under using very
6. Document Operations
Documents are the primary objects in IR systems
database must be given unique identifiers, parsed into their constituent fields, and those fields broken into field identifiers and terms.
it is possible to assign a probability of relevance to each document in a retrieved set, allowing retrieved documents to be ranked in order of probable relevance.
Term distribution information can also be used to
7. Hardware for IR
Hardware affects the design of IR systems because it
determines, in part, the operating speed of an IR system--a crucial factor in interactive information systems.
Along with the need for greater speed there is the
need for storage media capable of compactly holding the huge document database that have proliferated.
of words in documents and in the database as a whole are often used for ranking retrieved documents.
Finally, the words and associated information such
as the documents, fields within the documents, and counts are put into the database.
identifiers and keywords as follows. keyword1 - document1-Field_2 keyword2 - document1-Field_2, 5 keyword2 - document3-Field_1, 2 keyword3 - document3-Field_3, 4 keyword-n - document-n-Field_i, j Such a structure is called what we have already talked about, an inverted file. In an IR system, each document must have a unique identifier, and its fields, if field operations are supported, must have unique field names.
IR Evaluation Criteria
Effectiveness
Efficiency
Usability
IR Effectiveness Evaluation
User-centered strategy Given several users, and at least 2 retrieval systems Have each user try the same task on both systems Measure which system works the best System-centered strategy Given documents, queries, and relevance judgments Try several variations on the retrieval system Measure which ranks more good docs near the top
B.
C.
D.
E.
F.
= relevant document
R
R R R R R R R R R R R R R R R
R
R
R
R R
R
R R R R R R R R R R R R
Defining Relevance
Relevance relates a topic and a document Duplicates are equally relevant by definition Constant over time and across users Relevance may include concerns such as timeliness, authority or novelty of the result. Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document
Another View
Space of all documents
Relevant
Relevant + Retrieved
Retrieved
Effectiveness Measures
Action Doc Relevant Not relevant Retrieved Relevant Retrieved False Alarm Not Retrieved Miss Irrelevant Rejected
UserOriented
Hits 1-10
Precision Recall
1/1 1/14 1/2 1/14 1/3 1/14 1/4 1/14 2/5 2/14 3/6 3/14 3/7 3/14 4/8 4/14 4/9 4/14 4/10 4/14
= relevant document
Recall 1.0
FAR, FRR
The false acceptance rate, or FAR, is the measure of
the likelihood that the system will incorrectly retrieve an irrelevant document. A systems FAR typically is stated as the ratio of the number of false retrievals divided by the number of retrievals made.
The false rejection rate, or FRR, is the measure of the
likelihood that the system will incorrectly not retrieve a relevant document. A systems FRR typically is stated as the ratio of the number of false rejections divided by the number of retrievals made .
Contd..
Thank You.