1IR Chap1

Introduction to Information Storage and Retrieval Systems
BYRESEARCH SCHOLAR
1. Introduction
Information
retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. Automated information retrieval systems are used to reduce what has been called "information overload"
Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.
The problem of IR
Goal = find documents relevant to an
information need from a large document set

Info. need
Query
Document collection
Retrieval
IR system
Answer list
Example
4
Google
Web
Information Retrieval Applications

Digital libraries and archives Media search Blog search Image retrieval Music retrieval News search Speech retrieval Video retrieval Search engines Desktop search Enterprise search Mobile search Social search Web search
Domain specific applications of information retrieval
Geographic information retrieval

Information retrieval for chemical structures Information retrieval in software engineering
Legal information retrieval
Differences between DMBS and IR

Databases What is retrieved Queries Outcomes Structured data, clear semantics Unambiguous, formal, mathematical Exact and correct IR Unstructured, free text NL based vague and imprecise needs Vague list-imprecise Generally not relevant
Interaction One shot
Continuous refinement
2. Conceptual Models of IR
An IR conceptual model is a general approach to IR
systems. Faloutsos (1985) gives three basic approaches: text pattern search, inverted file search, and signature search. Belkin and Croft (1987) categorize IR conceptual models differently. They divide retrieval techniques first into exact match and inexact match.
Contd..
The exact match category contains text pattern
search and Boolean search techniques. Text pattern search queries are strings or regular expressions. Text pattern systems are more common for searching small collections, such as personal collections of files. In a Boolean IR System, documents are represented by sets of keywords, usually stored in an inverted file.
The inexact match category contains such techniques
as probabilistic, vector space, and clustering, among others. It is possible to assign a probability of relevance to each documents in retrieved set allowing retrieved documents to be ranked in order of probable relevance. It is possible to group(cluster) documents based on the terms that they contain and to retrieve from these groups using a ranking methodology.
Important considerations for an IR system

File Structure
Query Operations
Term Operations Document operations
Hardware for IR
3. File Structures
A fundamental decision in the design of IR systems is which
type of file structure to use for the underlying document database.

The file structures used in IR systems are flat files, inverted
files, signature files, PAT trees, and graphs.

Though it is possible to keep file structures in main
memory, in practice IR databases are usually stored on disk because of their size.
3.1 Flat and Inverted files

Flat File : One or more documents are stored in a
file. Flat File Search is usually done via pattern matching.

Inverted File: It is a kind of indexed file.
Structure of Indexed File- <keyword, document-ID,
field-ID>
Unique name that indicates from which field in the document the keyword came
Indexing terms that describes the document
Unique Identifier for a document
3.2 Signature Files

Signature Files: It contains signature-bit patterns- that
represent documents. Signature Method: documents are split into logical blocks each containing a fixed number of distinct significant, that is, non-stoplist, words. Each word in the block is hashed to give a signature a bit pattern with some of the bits set to 1. The signature of each word in a block are ORed together to create a block signature. The block signatures are then concatenated to produce the document signature. Searching is done by comparing the signatures of queries with document signatures.
Graphs, or Network, are ordered collections of nodes
connected by arcs. For example, a kind of graph called a semantic net can be used to represent the semantic relationships in text often lost in the indexing systems above. Graph based techniques for IR are impractical now because of the amount of manual effort that would be needed to represent a large document collection in this form.
4. Query Operations
Queries are formal statements of information needs put to
the IR system by users. The operations on queries are obviously a function of the type of query, and the capabilities of the IR system. One common query operation is parsing, that is breaking the query into its constituent elements. Boolean queries, for example, must be parsed into their constituent terms and operators. The set of document identifiers associated with each query term is retrieved, and the sets are then combined according to the Boolean operators.
For e.g. Let us consider Shakespeares Collected
Works, and use it to introduce the basics of the Boolean retrieval model.
Suppose we record for each document here a play
of Shakespeares whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 different words)
Now, depending on whether we look at the matrix rows
or columns, we can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it.
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100
The answers for this query are thus Antony and Cleopatra and Hamlet
5. Term Operations
Operations on terms in an IR system include stemming, truncation, weighting, and stop list and thesaurus operations. Stemming is the automated conflation (fusing or
combining) of related words, usually by reducing the words to a common root form. Take, taken, taking would result in take. Walk, walking, walked would result in walk. Computation, Computing, would result in compute.
Truncation is manual conflation of terms by using
wildcard characters in the word, so that the truncated term will match multiple words. Truncation allows you to search for various word endings and spellings simultaneously. It allows you to retrieve results with all the different endings of that root word.
Another way of conflating related terms is with a
thesaurus which lists synonymous terms, and sometimes the relationships among them.
A stoplist is a list of words considered to have no
indexing value, used to eliminate potential indexing terms. Each potential indexing term is checked against the stoplist and eliminated if found there.
Examples of stop list words are

and another any are around as at be became because do does
doesn't doing don't during each else every it's its itself just
know most name need rather said same there under using very
6. Document Operations
Documents are the primary objects in IR systems
and there are many operations for them.

In many types of IR systems, documents added to a
database must be given unique identifiers, parsed into their constituent fields, and those fields broken into field identifiers and terms.
Using the information about terms in the document,
it is possible to assign a probability of relevance to each document in a retrieved set, allowing retrieved documents to be ranked in order of probable relevance.
Term distribution information can also be used to
cluster similar documents in a document space.
7. Hardware for IR
Hardware affects the design of IR systems because it
determines, in part, the operating speed of an IR system--a crucial factor in interactive information systems.
Along with the need for greater speed there is the
need for storage media capable of compactly holding the huge document database that have proliferated.
Functional View of Paradigm IR System
When building the database, documents are taken
one by one, and their text is broken into words.

The words from the documents are compared
against a stop list--a list of words thought to have no indexing value.

Words from the document not found in the stop list
may next be stemmed.
Words may then also be counted, since the frequency
of words in documents and in the database as a whole are often used for ranking retrieved documents.
Finally, the words and associated information such
as the documents, fields within the documents, and counts are put into the database.
The database then might consist of pairs of document
identifiers and keywords as follows. keyword1 - document1-Field_2 keyword2 - document1-Field_2, 5 keyword2 - document3-Field_1, 2 keyword3 - document3-Field_3, 4 keyword-n - document-n-Field_i, j Such a structure is called what we have already talked about, an inverted file. In an IR system, each document must have a unique identifier, and its fields, if field operations are supported, must have unique field names.
IR Evaluation Criteria
Effectiveness
Efficiency
Usability
IR Effectiveness Evaluation
User-centered strategy Given several users, and at least 2 retrieval systems Have each user try the same task on both systems Measure which system works the best System-centered strategy Given documents, queries, and relevance judgments Try several variations on the retrieval system Measure which ranks more good docs near the top
Which is the Best Rank Order?

A.
B.
C.
D.
E.
F.
= relevant document
Which is the Best Rank Order?

a b c d e f g h
R
R R R R R R R R R R R R R R R
R
R
R
R R
R
R R R R R R R R R R R R
Defining Relevance
Relevance relates a topic and a document Duplicates are equally relevant by definition Constant over time and across users Relevance may include concerns such as timeliness, authority or novelty of the result. Pertinence relates a task and a document Accounts for quality, complexity, language, Utility relates a user and a document
Another View
Space of all documents
Relevant
Relevant + Retrieved
Retrieved
Not Relevant + Not Retrieved
Set-Based Effectiveness Measures
Precision How much of what was found is relevant?
Often of interest, particularly for interactive searching
Recall How much of what is relevant was found?
Particularly important for law, patents, and medicine
Effectiveness Measures
Action Doc Relevant Not relevant Retrieved Relevant Retrieved False Alarm Not Retrieved Miss Irrelevant Rejected
UserOriented
Relevant Retrieved Precision Retrieved Relevant Retrieved Recall Relevant
Measuring Precision and Recall

Assume there are a total of 14 relevant documents
Hits 1-10
Precision Recall
1/1 1/14 1/2 1/14 1/3 1/14 1/4 1/14 2/5 2/14 3/6 3/14 3/7 3/14 4/8 4/14 4/9 4/14 4/10 4/14
Hits 11-20 Precision Recall

5/11 5/14 5/12 5/14 5/13 5/14 5/14 5/14 5/15 5/14 6/16 6/14 6/17 6/14 6/18 6/14 6/19 6/14 4/20 6/14
= relevant document
General form of precision/recall

40
Precision 1.0
Recall 1.0
-Precision change w.r.t. Recall
FAR, FRR
The false acceptance rate, or FAR, is the measure of
the likelihood that the system will incorrectly retrieve an irrelevant document. A systems FAR typically is stated as the ratio of the number of false retrievals divided by the number of retrievals made.
The false rejection rate, or FRR, is the measure of the
likelihood that the system will incorrectly not retrieve a relevant document. A systems FRR typically is stated as the ratio of the number of false rejections divided by the number of retrievals made .
Contd..
Thank You.

1IR Chap1

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

1IR Chap1

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Information Storage and Retrieval Systems

information need from a large document set

Information Retrieval Applications

Domain specific applications of information retrieval

Geographic information retrieval

Legal information retrieval

Differences between DMBS and IR

Interaction One shot

The inexact match category contains such techniques

Important considerations for an IR system

type of file structure to use for the underlying document database.

files, signature files, PAT trees, and graphs.

3.1 Flat and Inverted files

file. Flat File Search is usually done via pattern matching.

Indexing terms that describes the document

Unique Identifier for a document

3.2 Signature Files

Graphs, or Network, are ordered collections of nodes

For e.g. Let us consider Shakespeares Collected

Now, depending on whether we look at the matrix rows

Truncation is manual conflation of terms by using

Another way of conflating related terms is with a

Examples of stop list words are

and another any are around as at be became because do does

and there are many operations for them.

Using the information about terms in the document,

cluster similar documents in a document space.

Functional View of Paradigm IR System

When building the database, documents are taken

one by one, and their text is broken into words.

against a stop list--a list of words thought to have no indexing value.

may next be stemmed.

Words may then also be counted, since the frequency

The database then might consist of pairs of document

Which is the Best Rank Order?

Which is the Best Rank Order?

Not Relevant + Not Retrieved

Set-Based Effectiveness Measures

Precision How much of what was found is relevant?

Often of interest, particularly for interactive searching

Recall How much of what is relevant was found?

Particularly important for law, patents, and medicine

Relevant Retrieved Precision Retrieved Relevant Retrieved Recall Relevant

Measuring Precision and Recall

Hits 11-20 Precision Recall

General form of precision/recall

-Precision change w.r.t. Recall

Das könnte Ihnen auch gefallen