Beruflich Dokumente
Kultur Dokumente
22.11.2015, 14.34
Wikipendium
web
query
documents
crawler
ranking
multimedia
search
data
indexing
1 - Introduction
1.1 - Information retrieval
Information retrieval deals with the representation, storage, organization of, and access to information items
such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia object. The
representation and organization of the information items should be such as to provide the users with easy
access to information of their interest.
IR includes modeling, Web search, text classification, systems architecture, user interfaces, data visualization,
filtering, languages.
IR consist mainly of building up efficient indexes, processing user queries with high performance, and
developing ranking algorithms to improve the results.
In one form or another, indexes are at the core of every modern information retrieval system.
The web has become a universal repository of human knowledge and culture. Users have created billions of
documents, and finding useful information is not an easy task unless running a search, and search is all about
IR and its technologies.
Information retrieval
Side 1 av 15
22.11.2015, 14.34
Content
Data
Information
Data object
Table
Document
Matching
Exact
Partial
Items wanted
Matching
Relevant
Query language
Artificial
Natural
Query specification
Complete
Incomplete
Model
Deterministic
Probabilistic
Structure
High
Less
Side 2 av 15
22.11.2015, 14.34
2.1 - Introduction
The role of the search user interface is to aid in the searcher's understanding and expression of their
information needs, and to help users formulate their queries, select among available information sources,
understand search results, and keep track of the progress of their search.
Side 3 av 15
22.11.2015, 14.34
time and again to be difficult for most users to understand, and those who try to use them often do so
incorrectly.
On Web search engines today, conjuctive ranking is the norm, where only documents containing all query
terms are displayed in the results. However, Web search engines have become more sophisticated about
dropping terms that would result in empty hits, while matching the important terms, ranking the hits higher
that have these terms in close proximity to one another, and using the other aspects of ranking that have been
found to work well for Web search.
In design, studies suggest a relationship between query length and the width of the entry form; small forms
discourage long queries and wide forms encourage longer queries.
Auto-complete (or auto-suggest or dynamic query suggestions) which is shown in real time has greatly
improved query specification.
In some Web search engines the query is run immeadiately; on others the user must hit the Return key or click
the Search button.
When displaying the search results, the document surrogate refers to the information that summarizes the
document. The text summary containing text extracted from the document is also critical for assessment of
retrieval results. Several studies have shown that longer results are deemed better than shorter ones for certain
types of information needs. Users engaged in known-item searching tend to prefer short surrogates that clearly
indicate the desired information.
One of the most important query reformulation techniques consists of showing terms related to the query or to
the documents retrieved in response to the query. Usually only one suggested alternative is shown; clicking on
that alternative re-executes the query. Search engines are increasingly employing related term suggestions,
referred to as term expansion. Relevance feedback has been shown in non-interactive or artificial settings to be
able to greatly improve rank ordering, but this method has not been successful from a usability perspective.
In organizing search results, a category system is a set of meaningful labels organized in such a way as to
reflect the concepts relevant to a domain. Most commonly used category structures are flat, hierarchical, and
faceted categories. Flat categories are simply lists of topics or subjects. Hierarchical organization online is most
commonly seen in desktop file system browsers. It is, however, difficult to maintain a strict hieratchy when the
information collection is large. Faceted metadata, has become the primary way to organize Web site content
and search results.Faceted metadata consists of a set of categories (flat or hierarchical).
Usability studies find that users like, and are successful, using faceted navigation, if the interface is designed
properly, and that faceted interfaces are overwhelmingly preferred for collection search and browsing as an
alternative to the standard keyword-and-results listing interface.
Clustering refers to the grouping of items according to some measure of similarity. The greatest advantage is
that it is fully automatic and can be easily applied to any text collection. The disadvantages include an
unpredictability in the form and quality of results.
One drawback of faceted interfaces versus clusters are that the categories of interest must be known in
advance. The largest drawback is the fact that in most cases the category hierachies are built by hand.
Side 4 av 15
22.11.2015, 14.34
Information visualization has become a common presence in news reporting and financial analysis, but
visualization of inherently abstract information is more difficult, and visualization of textually represented
information is especially challenging.
When using boolean syntax, Hertzum and Frokjaer found that a simple Venn diagram representation produced
more accurate results.
One of the best known experimental visualizations is the TileBars interface which documents are shown as
horizontal glyphs with the locations of the query term hits marked along the glyph. Variations of the TileBars
display have been proposed, including a simplified version which shows only one square per query term, and
color gradation is used to show query term frequency. Other approaches to showing query term hits within
document collections include placing the query terms in bar charts, scatter plots, and tables. A usability study
that compared five views to the Web-style listing, there were no significant differences for task effectiveness for
the other confitions, except for bar charts, which were significantly worse. All confitions had significantly
higher mean task times than the Web-style listing.
Evaluations that have been conducted so far provide negative evidence as to their usefulness. For text mining,
most users of search systems are not interested in seeing how words are distributed across documents.
3 - Modeling
3.1 - IR Models
Modeling in IR is a process aimed at producing a ranking function that assigns scores to documents for a given
query.
https://www.wikipendium.no/TDT4117_Information_retrieval
Side 5 av 15
22.11.2015, 14.34
tfi,j =
We want high scores for frequent terms, but we want even higher score for a rare, descriptive term. These
terms introduce a good discrimination value, and their score is captured in the Inverse Document Frequency:
id fi = log
N
ni
Where N is the total number of documents in the collection and ni is the number of documents which contain
keyword ki .
If we combine these two scores we get the best known weighting scheme in IR: the TF-IDF weighting scheme.
This measure increases with the number of occurrences within a document and with the rarity of the terms in
the collection.
wi,j =
N
ni
if fi,j > 0
otherwise
https://www.wikipendium.no/TDT4117_Information_retrieval
sim( j , q) =
Side 6 av 15
22.11.2015, 14.34
dj q
sim(dj , q) =
d q
j
Probabilistic model
There exists a subset of the documents collection that are relevant to a given query. A probabilistic retrieval
model ranks this set by the probability that the document is relevant to the query. The advantage of this model
is that documents are ranked in decreasing order of their probability of being relevant. the disadvantages is the
need to guess the initial separation of documents into relevant and non-relevant sets.
Ranking in the Probabilistic model: The similarity function uses the Robertsen-Sparck Jones equation:
sim(dj , q)
ki qki dj
log
N + .5
ni + .5
Language model
Based on finite state automata...
4 Retrieval Evaluation
4.3 Retrieval Metrics
Precision & Recall
Precision is the fraction of retrieved documents that are relevant. Precision = P(relevant|retrieved)
Recall is the fraction of relevant documents that are retrieved. Recall = P(retrieved|relevant)
Precision and recall are to be used on unranked sets. When dealing with ranked lists you compute the
measures for each of the returned documents, from the most relevant to the least (by their ranking; top 1, top 2,
etc.) This gives you the precision-recall-curve. The interpolation of this result is simply to let each precision be
the maximum of all future points. This removes the "jiggles" in plot, and is necessary to compute the precision
at recall-level 0. A recall-level is ??.
https://www.wikipendium.no/TDT4117_Information_retrieval
F(j) =
2r(j)p(j)
Side 7 av 15
22.11.2015, 14.34
F(j) =
1
r(j)
2
+
1
p(j)
2r(j)p(j)
r(j) + p(j)
Where r(j) and p(j) is the recall and precision of the j-th document. (The F-measure involves the variables and
. With these values set to 0.5 and 1 we get the Harmonic Mean shown above). It is 0 when no relevant
documents are retrieved, and 1 when all ranked documents are relevant. A high value of F symobls a good
compromise between precision and recall.
dj
d D
dj
d D
Rocchio Algorithm
q
m = q +
Ide Regular
q
d
dj
m = q +
j
d D
d D
Ide dec hi
q
dj MaxRank(Dn )
m = q +
d D
Dr
|Dn |
Where qm is the modified query, dj is a weigted term vector associated with documents dj , Dr is the set of
relevant documents retrieved, Dn is the set of non-relevant documents, and , and are constants. (The
absolute value means the number of documents in the set.)
The basic idea behind all three is to reformulate the query such that it gets closer to the neighborhood of
relevant documents in the vector space and away from the neighborhood of the non-relevant documents. The
https://www.wikipendium.no/TDT4117_Information_retrieval
Side 8 av 15
22.11.2015, 14.34
differences in the methods is that Rocchio normalizes the number of relevant and non-relevant documents, Ide
regular do not. Ide dec hi uses the highest ranked non-relevant document, insted of the sum of all. They all
yield similar results and improved performance (precision & recall). They uses both Query Expansion and Term
Re-weighing, and they are simple (because they compute the modified term wights directly from the set of
retrieved documents).
Probabilistic model
The Relevance Feedback procedure for the Probabilistic model uses statistics found in retrieved documents.
P(ki |R) =
Dr,i
Dr
and P(ki |R ) =
ni Dr,i
NDr
Where Dr is the set of relevant and retrieved documents, and Dr,i is the set of relevant and retrieved
documents that contain ki .
The main advantage of this relevance feedback procedure are that the feedback process is directly related to the
derivation of new weights for the query terms. The disadvantages is that it uses only term reweighing (no
Query Expansion), the document term weight is not incorporated and initial query term weights are
disregarded.
Local Analysis
Clusters
Forms of deriving synonymy relationship between two local terms, building association matrices quantifying
term correlations.
Association clusters are based on the frequency of co-occurrence of terms inside documents, it does not take
into account where in the document the terms occur.
Metric clusters are based on the idea that two terms occurring in the same sentence tend to be more correlated,
and factor the distance between the terms in the computation of their correlation factor.
Scalar clusters are based on the idea that two terms with similar neighborhoods have some synonymy
relationship, and uses this to compute the correlations.
Global Analysis
Determine term similarity through a pre-computed statistical analysis on the complete collection. Expand
queries with statistically most similar terms. Two methods: Similarity Thesaurus and Statistical Thesaurus. (A
thesaurus provides information on synonyms and semantically related words and phrases.) Increases recall,
may reduce precision.
https://www.wikipendium.no/TDT4117_Information_retrieval
Side 9 av 15
22.11.2015, 14.34
6 Text Operations
6.6 Document Preprocessing
There are 5 text transformations (operations) used to prepare a document for indexing.
Lexical analysis
Stopword elimination
Stemming
Thesauri
Lexical analysis
Numbers are of little value alone and should be removed. Combinations of numbers and words could be kept.
Hyphens (dashes) between words should be replaced with whitespace. Punctuation marks are usually removed,
except when dealing with program code etc. To deal with the case of letters; convert all text to either upper or
lower case.
Stopword elimination
Stopwords are words occurring in over 80% of the document collection. These are not good candidates for
index terms and should be removed before indexing. (Can reduce the index by 40%) This often includes words
as articles, prepositions, conjunctions. Some verbs, adverbs, adjectives. Lists of popular stopwords can be
found on the Internet.
Stemming
The idea of stemming is to let any conjugation and plurality of a word produce the same result in a query. This
improves the retrieval performance and reduce the index size. There are four types of stemming: Affix removal,
Table Lookup, Successor variety, N-grams. Affix removal is the most intuitive, simple and effective and is the focus
of the course, with use of the Porter Algorithm.
https://www.wikipendium.no/TDT4117_Information_retrieval
Side 10 av 15
22.11.2015, 14.34
Thesauri
To assist a user for proper query terms you can construct a Thesauri. This provides a hierarchy that allows the
broadening and narrowing of a query. It is done by creating a list of important words in a given domain. For
each word provide a set of related words, derived from synonymity relationship.
Side 11 av 15
22.11.2015, 14.34
Block addressing can be used to reduce space requirements. This is done by dividing text into blocks and let
the occurrences point to the blocks.
Searching in inverted indexes are done in three steps:
1. Vocabulary search - words in queries are isolated and searched separately.
2. Retrieval of occurrences - retrieving occurrence of all the words.
3. Manipulation of occurrences - occurrences processed to solve phrases, proximity or Boolean operations.
Ranked retrieval: When dealing with weight-sorted inverted lists we wont the best result. Sequentially searching
through all the documents are time consuming, we just want the top-k documents. This is trivial with a single
word query; the list is already sorted and you return the first k-documents. For other query we need to merge
the lists. (see Persins algorithm).
When constructing a full-text inverted index there are two sets of algorithms and methods: Internal
Algorithms and External Algorithms. The difference is wether or not we can store the text and the index in
internal, main memory. The former is relatively simple and low-cost, while the latter needs to write partial
indexes to disk and then merge them to one index file.
In general, there are three different ways to maintain an inverted index:
Rebuild, simple on small texts.
Incremental updates, done while searching only and when needed.
Intermittent merge, new documents are indexed and the new index is merged with the existing. This is,
in general, the best solution.
Inverted indexes can be compressed in the same way as documents (chapter 6.8). Some popular coding
schemes are: Unary, Elias- , Elias- and Golomb.
Heaps Law estimates the number of distinct words in a document or collection. Predicting the growth of the
vocabulary size. V = Kn , where n is the size of the document or collection (number of words), and
Zipfs law estimates the distribution of words across documents in the collection (approximate model). It states
that if t1 is the most common word in the collection, t2 the next most common, and so on, then the frequency of
fi of the i-th most common word is proportional to 1i . That is: fi = ci , where c is a constant. (?)
Side 12 av 15
22.11.2015, 14.34
(Chinese Japanese, Korean), agglutinating languages (Finnish, German). A Suffix trie is an ordered tree data
structure built over the suffixes of a string. A Suffix tree is a compressed trie. And a Suffix array is a
flattened tree. These structures handles the whole text as a string, dividing it into suffixes. Either by character
or word. Each suffix is from its start point to the end of the text, making them smaller and smaller. e.g.:
mississippi (1)
ississippi (2)
pi (10)
i (11)
These structures makes it easier to search for substrings but they have large space requirements: A tree takes
up to 20 times the space of the text and an array about 4 times the text space.
11 Web Retrieval
Link-based ranking
HITS, or Hyperlink-Induced Topic Search, divides pages into two sets; Authorities and Hubs. If page i contains
valuable information on a subject, it is called an Authority. If it contains many links to relevant documents
(authorities) it is called a Hub.
Pros
Cons
Pagerank differs from HITS because it produces a ranking independent of a users query. The concept is that a
page is considered important if it is pointed to by other important pages. Meaning; the PageRank-score for a
page is determined my summing the PageRanks of all pages that point to it.
Pros
Cons
Spam techniques
Keyword stuffing and hidden text involves misleading meta-tags, excessive repetition of keywords hidden
from the user with colors, stylesheet tricks, etc. left for the web crawlers to find. Most search engines catch
these now days.
Doorway page is a page optimized for a single keyword and redirects to the real target page.
https://www.wikipendium.no/TDT4117_Information_retrieval
Side 13 av 15
22.11.2015, 14.34
Lander pages are optimized for a singe keyword, or a misspelled domain name, designed to attract surfers
who will then click on ads.
Cloacking involves to serve two distinctive users different content. Used to serve fake content to search engine
crawler and spam to real users.
Link spam is about creating lots of links pointing to the page you want to promote, and put these links on
pages with high PageRank.
Search Engine Optimization (SEO) is a fine balance between spam and legitimate methods. Mostly involves
buying recommendation or working hard for promotion.
Indexing are performed over the features extracted from the multimedia object.
https://www.wikipendium.no/TDT4117_Information_retrieval
Side 14 av 15
22.11.2015, 14.34
In Content-Based Image Retrieval the task is to find similar images to one that the user queries (Query By
Example). This method ignores semantics and uses features like color, texture and shape. Color-Based
Retrieval can represent the image with a color histogram. This will be independent of e.g resolution, focal
point and pose, and the retrieval process is to compare the histograms. Texture-Based Retrieval extracts the
repetitive elements in the image and uses this as a feature.
Written by
Stian Jensen, cristea, nina, anna, boyebn, thormartin91, viktorfa, tmp, fiLLLip
https://www.wikipendium.no/TDT4117_Information_retrieval
Side 15 av 15