Sie sind auf Seite 1von 23

Information Retrieval

Lebanese University
Faculty of Economics and Business Administration 1st Branch Class: M1 Instructor: Dr. Lina A. Nimri
1

Course Text Book


Modern Information Retrieval, R. Baeza-yates and B. Ribeiro-Neto., Addison-Wesley and ACM Press, 1999, ISBN: 0-201-39829-X

Introduction

Modern Information Retrieval, Chapter 1

Ricardo Baeza-Yates, Berthier Ribeiro-Neto

Introduction

Examples of information need in the context of the world wide web: Find all documents containing information on computer courses which:
(1) are offered by universities in South England, and (2) are accredited by the BCS/IEE bodies, To be relevant, the document must include information on admission requirements, and e-mail and phone number for contact purpose.

Find all docs containing information on college tennis teams which:


(1) are maintained by a USA university and (2) participate in the NCAA tournament.

Information Retrieval

Representation, storage, organisation, and access to information items


Information Retrieval

(Usually) keyword-based representation


Documents

User Information Need Query

Useful or relevant information to the user


Set of retrieved documents

Search Engine Retrieval System

Primary goal of an IR system Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant documents as possible.

Data Retrieval
Determine which documents contain the keywords in the user query is not always enough to satisfy the user information need. Data Retrieval retrieves objects which satisfy clearly defined conditions, such as regular expressions or relational algebra expressions. Data Retrieval system deals with data with welldefined structure and semantics

Information Retrieval System


Retrieving information about a subject Deals with natural language text which is not well structured and could be semantically ambiguous It must interpret the contents of documents and rank them according to the degree of relevance to the user need.

Area of interest
Digital Libraries Information experts World Wide Web - Very difficult task

The hyperspace is vast The absence of a well defined data model

(format or representation form)

Effective retrieval
The

effective retrieval of relevant information is directly affected by:


The user task The logical view of the document

(documents representation) adopted by the retrieval system.

User tasks

Pull technology
User requests information in an interactive manner 3 retrieval tasks
Browsing (hypertext) Retrieval (classical IR

Push technology
automatic and

systems) Browsing and retrieval (modern digital libraries and web systems)

permanent pushing of information to user software agents example: news service filtering (retrieval task) relevant information for later inspection by user
10

Pulling
The user can browse the documents when his main objectives are not clear in the beginning and whose purpose might change during the interaction with the system. Combination of retrieval and browsing is not yet a well established approach.

Retrieval

Database Browsing 11

Documents
Unit of retrieval A passage of free text

composed of text, strings of characters

from an alphabet composed of natural language

newspaper article, a journal paper, a dictionary definition, email messages

size of documents arbitrary newspaper article vs. journal paper vs. email
12

What is a document?

13

Representation of documents
Documents are represented thru a set of index terms or

keywords or term descriptors


Most concise representation Poor quality of retrieval

extracted directly form text specified by human subjects (information science) metadata

Full text representation Large collections

Most complete representation High computational cost Reduce set of representative keywords
Elimination of stop words Stemming Identification of noun phrases Further compression and indexing

Document term descriptors to access texts

Generation of descriptors for text


By hand By analysing the text 14

Logical View of the documents


Docs

Accents spacing

stopwords

Noun groups

stemming

Manual indexing

structure structure Full text Index terms

15

The retrieval functions


Information need Documents Formulation Indexing Document representation

Query
Relevance feedback

Retrieval functions

Retrieved documents
16

Queries
Information Need:

Simple queries
composed of two or three, perhaps even

dozens, of keywords e.g., as in web retrieval

User term descriptors characterising the user need

Boolean queries
neural networks AND speech recognition

Context Queries
Proximity search, phrase queries
17

Best-Match retrieval

Compare the terms in a document and query Compute similarity between each document in the collection and the query based on the terms that they have in common Sorting the documents in order of decreasing similarity with the query The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system

Document term descriptors to access texts

User term descriptors characterising the user need

18

Conceptual view of text retrieval system


Queries Similarity Computation Documents

Retrieved Documents

19

Expanded view of text retrieval system


Queries Indexing
Similarity Computation

Indexed Documents

Documents

Retrieved Documents

Ranked Documents
20

Process of retrieving info


User Interface
User feedback User need Text Text

Text Operations
Logical view Logical view

Query Operations
Query

Indexing
Inverted file

Document Repository Manager

Similarity Computation (Searching)


Retrieved docs Ranked docs

Index

Text repository
21

Ranking

Key Topics
Indexing text documents Retrieving text documents Evaluation Query reformulations

Search Engines = IR + Link Structure + Name Interpretation


22

Information Retrieval vs Information Extraction


Information Retrieval
Given a set of query terms and a set of document terms select only the most relevant documents [precision], and preferably all the relevant [recall].

Information Extraction
Extract from the text what the document means.

IR systems can FIND documents but need not understand them

23

Das könnte Ihnen auch gefallen