Sie sind auf Seite 1von 87

CP7024 INFORMATION RETRIEVAL

TECHNIQUES

UNIT I - INTRODUCTION
UNIT II - MODELING
UNIT III INDEXING
UNIT IV - CLASSIFICATION AND CLUSTERING
UNIT V - SEARCHING AND RANKING
UNIT I - INTRODUCTION
Motivation
Basic Concepts
Practical Issues
Retrieval Process
Architecture
Boolean Retrieval
Retrieval Evaluation
Open Source IR Systems
History of Web Search
Web Characteristics
The impact of the web on IR
IR Versus Web Search
Components of a Search engine
Motivation- Information Retrieval

The process of actively seeking out information relevant


to a topic of interest.(van Rijsbergen)
Information retrieval (IR) deals with the representation,
storage, organization of, and access to information items.
The representation and organization of the information
items should provide the user with easy access to the
information in which he is interested.
Motivation
Motivation. Information retrieval (IR) deals
with the representation, storage, organization
of, and access to information items. ... Given
the user query, the key goal of an IR system is
to retrieve information which might be useful
or relevant to the user.
The last two decades have seen an enormous in- crease in
the amount of information available, in the form of text
documents as well as multimedia data such as images,
speech and video.
As a result, information retrieval (IR) has become a central
topic of computer science and related dis-ciplines and is
now part of many curricula for bachelor and master
programs.
information search was a task mostly executed by trained and
dedicated search professionals (librarians, database
administrators), nowadays, professionals, semi-professionals,
and hurried end-users share the same goal:
To find relevant information quickly. Therefore, information
retrieval (IR) is now part of various curricula for bachelor and
master pro-grams.
These programs range from library science over information
science to computer science;
Difference Between data retrieval and
information retrieval
Definition

Information Retrieval (IR) is finding material


(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
Basic ideas
Information overload
The challenging byproduct of the information age
Huge amounts of information available -- how to find
what you need when you need it
Think about addresses, e-mail messages, files of interesting
articles, etc.
Information retrieval is the formal study of
efficient and effective ways to extract the right bit
of information from a collection.
The web is a special case, as we will discuss.
The stages of IR

Creation Information Indexing,


organizing
Indexed Retrieval
and structured Searching
information Browsing
Types of IR

four distinct types of information retrieval systems


OPACs
online databases
digital libraries and web-based information services
web search engines
Role of the user interface in IR
INPUT
Problem definition

Source selection

Problem articulation

Engine
OUTPUT
Examination of results

Extraction of information

Integration with overall task


Applications
Text Search
Ad search
Image/Video search
Email Search
Question Answering
systems
Recommender
systems
Desktop Search
Expert Finding
16
Short history of IR modelling

Boolean model (1950)


Document similarity (1957)
Vector space model (1970)
Probabilistic retrieval(1976)
Language models (1998)
Linkage-based models (1998)
Positional models (2004)
Fielded models (2005)
17
Practical Issues
Vocabularies mismatching
Synonymy: e.g. car v.s. automobile
Polysemy: table
Queries are ambiguous, they are partial specification of users
need
Content representation may be inadequate and incomplete
The user is the ultimate judge, but we dont know how the
judge judges
The notion of relevance is imprecise, context- and user-dependent

18
Practical Issues(Conti..)
prompt dissemination of information
filtering of information
providing the right amount of information at the right time
active switching of information
receiving information in the desired form
browsing
getting information in an economical way
current literature
providing access to other information systems
interpersonal communication
offering personalized help.
The formalized IR process
Real world Anomalous state of knowledge

Collection of documents Information need

Document representations Query

Matching

Results
Architecture of IR
Supporting the search process
Elements of an information retrieval
system
Functions
to identify the information (sources) relevant to the areas of interest of the target users
community; this is a challenging job especially in the web environment where virtually
everybody in the world can be the potential user of a web-based information retrieval
system
to analyze the contents of the sources (documents); this is becoming increasingly
challenging as the size, volume and variety of information sources (documents) is
increasing rapidly; web information retrieval is carried out automatically using specially
designed programs called spiders
to represent the contents of analyzed sources in a way that matches users queries; this is
done by automatically creating one or more index files, and is becoming an increasingly
complex task due to the volume and variety of content and increasing user demands
to analyze users queries and represent them in a form that will be suitable for matching
the database; this is done in a number of ways, through the design of sophisticated search
interfaces including those that can provide some help to users for selection of appropriate
search terms by using dictionary and thesauri, automatic spell checkers, a predefined set of
search statements and so forth
to match the search statement with the stored database; a number of complex information
retrieval models have been developed over the years that are used to determine the
similarity of the query and stored documents
to retrieve relevant information; a variety of tools and techniques are used to determine
the relevance of retrieved items and their ranking
to make continuous changes in all aspects of the system, keeping in mind the rapid
developments in information and communication technologies (ICTs) relating to changing
patterns of society, users and their information needs and expectations.
Making the connections
Stemming
Making sure that simple variations in word form are recognized
as equivalent for the purpose of the search: exercise, exercises,
exercised, for example.
Indexing
A keyword or group of selected words
Any word (more general)
How to choose the most relevant terms to use as index elements
for a set of documents.
Build an inverted file for the chosen index terms.
Anatomy of a web page

Metatags: Information about the page


Primary source of indexing information for a search engine.
Ex. Title. Never mind what has an H1 tag (though that may be
considered), what is in the <title> </title> brackets?
Other tags provide information about the page. This is easier for
the search engine to use than determining the meaning of the text
of the page.
Dealing with the cheaters
False information provided in the web page to make the search
engine return this page
False metatags, invisible words (repeated many times), etc
Anatomy of a web page(conti..)
Anatomy of a web page(conti..)
Components
an information retrieval system comprises six major
subsystems the document subsystem
the indexing subsystem
the vocabulary subsystem
the searching subsystem
the user-system interface
the matching subsystem
Boolean Model
Boolean Model is one of the oldest and simplest models of
Information Retrieval.
It is based on set theory and Boolean algebra.A document is
represented as a set of keywords.
Queries are Boolean expressions of keywords, connected by
AND, OR, and NOT, including the use of brackets to
indicate scope.
[[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
Output: Document is relevant or not. No partial matches or
ranking.
In this model, each document is taken as a bag of index
terms.
Index terms are simply words or phrases from the document
that are important to establish the meaning of the document

31
Boolean Retrieval Model
The query is a Boolean algebra expression using connectives like ,, etc.
The documents retrieved are the documents that completely match the given
query.
Partial matches are not retrieved. Also, the retrieved set of documents is not
ordered.For example, Say, there are four documents in the system.
For each term in the query, a list of documents that contain the term is created.
Then the lists are merged according to the Boolean operators.

32
Boolean Model
Advantages
It is simple, efficient and easy to implement.
It was one of the earliest retrieval methods to be implemented. It
remained the primary retrieval model for at least three decades.
It is very precise in nature. The user exactly gets what is specified.
Boolean model is still widely used in small scale searches like
searching emails, files from local hard drives or in a mid-sized library.
Disadvantages
In Boolean model, the retrieval strategy is based on binary criteria.
So, partial matches are not retrieved. Only those documents that exactly
match the query are retrieved.
Hence, to effectively retrieve from a large set of documents users must
have a good domain knowledge to form good queries.
The retrieved documents are not ranked.

Boolean Models Problems


Very rigid: AND means all; OR means any.
Difficult to express complex user requests.
Difficult to control the number of documents
retrieved.
All matched documents will be returned.
Difficult to rank output.
All matched documents logically satisfy the query.
Difficult to perform relevance feedback.
If a document is identified by the user as relevant or
irrelevant, how should the query be modified?

35
Retrieval Evaluation
Contingency table of classification of documents
Actual Condition
Present Absent

fp fp type 1 error
Positive tp
type1
Test result
Negative fn fn type 2 error
tn
type2
present = tp + fn
positives = tp + fp
Total # of cases N = tp + fp + fn + tn negatives = fn + tn

False positive rate = fp/(negatives)


False negative rate = fn/(positives)
Example
Documents available:
D1,D2,D3,D4,D5,D6,
D7,D8,D9,D10 relevant not relevant
Relevant: D1, D4, D5,
D8, D10 retrieved D4,D5,D8 D2,D6,D9

Query to search
engine retrieves: D2, not retrieved D1,D10 D3,D7

D4, D5, D6, D8, D9


Relevant vs. Retrieved

All docs
Retrieved

Relevant
Precision vs. Recall
| RelRetrieved | | RelRetrieved |
Precision = Recall =
| Retrieved | | Rel in Collection |

All docs
Retrieved

Relevant
Evaluation of Matching:
Recall and Precision
If information retrieval were perfect ...
Every hit would be relevant to the original query, and every
relevant item in the body of information would be found.

Precision: percentage (or fraction) of the hits that are


relevant, i.e., the extent to which the set of hits
retrieved by a query satisfies the requirement that
generated the query.

Recall: percentage (or fraction) of the relevant items that are


found by the query, i.e., the extent to which the query
found all the items that satisfy the requirement.
Recall and Precision with Exact
Matching: Example

Collection of 10,000 documents, 50 on a specific topic


Ideal search finds these 50 documents and reject all others
Actual search identifies 25 documents; 20 are relevant but 5
were on other topics
Precision: 20/ 25 = 0.8 (80% of hits were relevant)
Recall: 20/50 = 0.4 (40% of relevant were found)
Measuring Precision and Recall
Precision is easy to measure:
A knowledgeable person looks at each document that is
identified and decides whether it is relevant.
In the example, only the 25 documents that are found need
to be examined.
Recall is difficult to measure:
To know all relevant items, a knowledgeable person must
go through the entire collection, looking at every object to
decide if it fits the criteria.
In the example, all 10,000 documents must be examined.
Open Source IR Systems

Google Search API


Lucene
Terrier
Lemur
Dragon
Groogle
Types of search engines
Q&A engines
Collaborative
Enterprise
Web
Metasearch
Semantic
NLP
...
52
Google Search API: AJAX
Web Search
Incorporate results from Web Search, News
Search, and Blog Search

Local Search
Provides access to local search results from
Google Maps.

Video Search
Incorporate a simple search box
incorporate dynamic, search powered strips of
video and book thumbnails.

53
Lucene
Cross-Platform API
Implemented in Java
Ported in C++, C#, Perl, Python
Offers scalable, high-performance indexing
Incremental indexing as fast as bath indexing
Index size roughly 20-30% the size of indexed text
Supports many powerful query types

54
Lucene: Modules
Analysis
Tokenization, Stop words, Stemming, etc.
Document
Unique ID for each document
Title of document, date modified, content, etc.
Index
Provides access and maintains indexes.
Query Parser
Search / Search Spans

55
Terrier: Overview (1/2)
Stands for TERabyte RetrIEveR.
Open Source API (Mozilla Public Licence).
Modular platform for the rapid development of large-scale IR
applications.
It is written in Java (and Perl)
Highly compressed disk data structures.
Handling large-scale document collections.
Standard evaluation of TREC ad-hoc and known-item search
retrieval results.
Based on a new parameter-free probabilistic framework for IR
(DFR), allowing adaptable term weighting functionalities.

56
Terrier: Indexing
Create your own Collection decoder and Document implementation.
Centralized or distributed Setting.
Indexer iterates through the collection and creates the following data
structures
Direct Index
Document Index
Lexicon

57
Lemur: Overview
Support for XML and structured document
retrieval
Interactive interfaces for Windows, Linux, and
Web
Cross-Platform, fast and modular code written
in C++
Free and open-source software

58
Lemur: Query Flow

Query
User Query Parser

Scoring
Nodes
runQuery()

Sorted Vector of Score Extent


Results

59
The Dragon Toolkit
University Of Drexel

60
Dragon: Overview (1/2)
Highly scalable to large data set
Well designed Programming API and XML-
based Interface
Various document representations including
words, multiword phrases, ontology-based
concepts, and concept pairs
Various text retrieval models
Text classification, clustering, summarization
and topic modeling
61
Dragon: Overview (2/2)
Provides built-in supports for semantic-based IR and TM
(different from Lucene and Lemur ).
Integrates a set of NLP tools, which enable the toolkit to
index text collections with various representation
schemes including words, phrases, ontology-based
concepts and relationships.
It is specially designed for large-scale application.
The toolkit uses sparse matrix to implement text
representations and does not have to load all data into
memory in the running time.
Can handle hundred thousands of documents with very
limited memory.

62
History of Web Search

History of Web Search


Web Characteristics
The impact of the web on IR
IR Versus Web Search
Components of a Search engine
What are Search Engines?
Designed to assist you in searching through
the enormous amount of information on
the Web
No single search tool has everything
Each engine is a large database which
utilizes different search techniques and
tools (spiders or robots) to build indexes to
the Internet (some also utilize submissions
and administration)

64
Web characteristics
The web graph
view the static Web consisting of static HTML pages
together with the hyperlinks between them as a
directed graph in which each web page is a node and
each hyperlink a directed edge.
Spam
This led to the first generation of spam, which SPAM
(in the context of web search) is the manipulation of
web page content for the purpose of appearing high up
in search results for selected keywords.
To avoid irritating users with these repetitions,
sophisticated spammers re-sorted to such tricks as
rendering these repeated terms in the same color as the
background.
Web crawler

Web crawling is the process by which we gather pages from


the web, in order to index them and support a search engine.
The objective of crawling is to quickly and efficiently
gather as many useful web pages as possible.
Crawling the web
Misnomer as the spider or robot does not actually move about
the web
Program sends a normal request for the page, just as a browser
would.
Retrieve the page and parse it.
Look for anchors -- pointers to other pages.
Put them on a list of URLs to visit
Extract key words (possibly all words) to use as index terms related
to that page
Take the next URL and do it again
Actually, the crawling and processing are parallel activities
Which Search Engine?
Yahoo
Altavista
Excite
Google
NorthernLights
Hotbot
Infoseek
See Handout - The Little Search Engine that Could

70
Which Search Engine?
Yahoo
Altavista
Excite
Google
NorthernLights
Hotbot
Infoseek
See Handout - The Little Search Engine that Could

71
Components of a Search engine

Generally there are three basic components


of a search engine as listed below:
Web Crawler.
Database.
Search Interfaces.
Query Engine Index

Interface

Indexer
Users

Crawler

Characteristics

Web
A Typical Web Search Engine
Whats SEO?
SEO = Search Engine Optimization

Refers to the process of optimizing both the on-page and off-page ranking
factors in order to achieve high search engine rankings for targeted search
terms.

Refers to the industry that revolves around obtaining high rankings in the
search engines for desirable keyword search terms as a means of increasing
the relevant traffic to a given website.

Refers to an individual or company that optimizes websites for its clientele.

Has a number of related meanings, and usually refers to an individual/firm


that focuses on optimizing for organic search engine rankings
Whats SEO based on
Features we know used in web page ranking
Page content
Page metadata
Anchor text
Links
User behavior
Others?
Search Engine Spam: Spamdexing
spamdexing (also known as search spam, search
engine spam, web spam or search engine
poisoning) is the deliberate manipulation of search
engine indexes
aka: black hat SEO
Porn led the way 96
Search Engine Spamdexing Methods

Content based
Link spam
Cloaking
Mirror sites
URL redirection
Duplicate detection

Get rid of duplicates; save space and time


Product or idea evolution (near duplicates)
Check for stolen information, plagiarism,
etc.

Das könnte Ihnen auch gefallen