You are on page 1of 4

Major Information Retrieval Models

The following major models have been developed to retrieve information: the Boolean model,
the Statistical model, which includes the vector space and the probabilistic retrieval model, and the
Linguistic and Knowledge-based models. The first model is often referred to as the "exact match"
model; the latter ones as the "best match" models
The Boolean Model
Based on set theory and the Boolean algebra
Historically the most common model used in Library OPACs, Dialog system and Many web search
engines, too
Documents are sets of terms
Queries are Boolean expressions on terms
Boolean logic allows a user to logically relate multiple concepts together to define what information is
needed. Typically the Boolean functions apply to processing tokens identified anywhere within an item.
The typical Boolean operators are AND, OR, and NOT. These operations are implemented using set
intersection, set union and set difference procedures.
ANDrequires both terms to be in each item returned. If one term is contained in the document and the
other is not, the item is not included in the resulting list. (Narrows the search)
Example: A search on stock market AND trading includes results contains: stock market trading; trading
on the stock market; and trading on the late afternoon stock market
OReither term (or both) will be in the returned document. (Broadens the search) Example: A search on
ecology OR pollution includes results contains: documents containing the world ecology (but not
pollution) and other documents containing the word pollution (but not ecology) as well as documents with
ecology and pollution in either order or number of uses.
NOT or AND NOT ( dependent upon the coding of the database's search engine)the first term is
searched, then any records containing the term after the operators are subtracted from the results.
Parentheses will help you group and order a mixture of Boolean operators: e.g (mouse OR rat OR mice)
AND cats , ((mouse OR rat) AND trap) OR mousetrap
Nested parenthesis Innermost parenthetical group is processed first.
Proximity
Proximity operators vary by database. There are two standard types
Near Finds words within a certain number of each other, regardless of word order.

Within Finds words within a certain number of each other in the order you specify
JSTOR allows you to find terms that are within a specific number of words of each other using the tilde
(~) as a proximity operator.
For example, to search for an item with the terms debt and forgiveness within ten words of each other,
you would construct the following query: "debt forgiveness"~10
Contiguous Word Phrases
A Contiguous Word Phrase (CWP) is both a way of specifying a query term and a special search operator.
A Contiguous Word Phrase is two or more words that are treated as a single semantic unit. An example of
a CWP is United States of America. It is four words that specify a search term representing a single
specific semantic concept (a country)
Fuzzy Searches
Fuzzy Searches provide the capability to locate spellings of words that are similar to the entered search
term. This function is primarily used to compensate for errors in spelling of words. Fuzzy searching
increases recall at the expense of decreasing precision (i.e., it can erroneously identify terms as the search
term).
The Statistical models
Vector space model
In the vector space model, the assumption is made that the stored records (documents) and the
information requests are represented by sets of assigned keywords or index terms. This implies that
queries and documents can be modeled by term vectors of the form
Dj = (aj1, aj2 ,... ajt)
Qj = (qj1,qj2, qjt )
where t is the number of distinct index terms available in the system, and a. jk and qjk represent the
values of term k in document Di or query Qi, respectively.
Typically, aik Or (qjk. ) Might be set equal to 1 when term k appear; in document D i (or in query Qj), and
to 0 if the term is absent from the vector.
Alternatively, the vector coefficients could take on numerical values, the size of each coefficient
depending on the importance of the term in the respective document or query.
The vector space model is known to be advantageous for a variety of reasons:
a) The similarity between term vectors is easily computed, based on the similarities between the term
assignments to the corresponding vectors. Similarity coefficients can then be generated between queries
and documents for information retrieval, or between different document vectors for document clustering
purposes.

b) When the documents are arranged in decreasing order of query document similarity, a ranking of the
documents becomes available, and documents can be retrieved in decreasing order of query-document
similarity. A document ranking feature improves the interaction between users and system during the
retrieval process.
c) The vector system, the document vectors are easily modified either by addition of new terms and
removal of old terms, or by suitable alterations in the term weights. This vector modification process is
especially useful in query vector
Advantages
The vector space model has the following advantages over the Standard Boolean model:

Simple model based on linear algebra


Term weights not binary
Allows computing a continuous degree of similarity between queries and documents
Allows ranking documents according to their possible relevance
Allows partial matching
need explanation for finding correlation

Limitations
The vector space model has the following limitations:
Long documents are poorly represented because they have poor similarity values (a small scalar
product and a large dimensionality)
Search keywords must precisely match document terms; word substrings might result in a "false
positive match"
Semantic sensitivity; documents with similar context but different term vocabulary won't be
associated, resulting in a "false negative match".
The order in which the terms appear in the document is lost in the vector space representation.
Assumes terms are statistically independent
Weighting is intuitive but not very formal
Many of these difficulties can, however, be overcome by the integration of various tools, including
mathematical techniques such as singular value decomposition and lexical databases such as WordNet.

The probabilistic models


The probabilistic models were first introduced in the early 1960s and represent an attempt to put the
retrieval operations on a sound theoretical basis. The basic premise is that a document should be retrieved
if its probability of relevance to the user's needs exceeds the probability of non-relevance. The
probabilistic approach thus introduces the notion of relevance and non-relevance of a document which is
absent from the vector and Boolean models.
This renders necessary the distinction of term characteristics in the relevant and non-relevant portions of a
collection.

The main attraction of the probabilistic models is that in principle a large number of phenomena about
terms and their occurrence characteristics may be taken into account, including for example term cooccurrences for any subset of terms; term relationship indications derived, for example, from existing
semantic nets or other constructs used in artificial intelligence approaches; historical knowledge
about how well certain terms may have done previously in retrieving relevant information in response
to similar information needs; information about term meaning and term relationships derived from
dictionaries and thesauruses; and any prior knowledge about the occurrence distribution of terms in
certain parts of the collection. Because the probabilistic model can accommodate all this intelligence
about documents and queries, it offers the promise of vastly greater effectiveness than the basic vector
and Boolean models.
Weaknesses
a) Because the probabilistic system is not based on existing initial query formulations, the
opportunity of independent weighting of query and document terms that exists in the vector
system is lost in the probabilistic environment.
b) ln the normal relevance feedback approach, the initial query terms are considered to be
crucially important. Since initial query terms are not available in the probabilistic system, a
probabilistic relevance feedback operation may produce inferior results.
c) The probabilistic approach can incorporate unspecified term dependencies; no distinction is
made, however, between different types of dependencies of the kind implicitly specified in
the Boolean model (where term synonyms are expressed by or-operators, and term phrases
by and-operators). In practice, a completely parallel treatment of very different classes of
term dependencies may not produce useful retrieval results.
d) Some objective measurements that are routinely used in a vector system, such as the number of
terms attached to a document, or the sum of the weights of the document terms, are excluded
from the existing probabilistic approaches.