Sie sind auf Seite 1von 15

MODEL-MODEL PADA

INFORMATION RETRIEVAL

Catur Supriyanto
catur.supriyanto@dsn.dinus.ac.id

Fakultas Ilmu Komputer


Universitas Dian Nuswantoro, Semarang
Outline
• IR Model
• Boolean Model
• Vector Space Model
• General Process in VSM
• Preprocessing
• Tokenization
• Stemming
• Stopword Removal
• Term Weighting
• Term Frequency Invers Document Frequency
Model in Information Retrieval
1. Boolean Model
2. Vector Space Model
Boolean Model
• Document is represented by set of keyword.
• Query consist of set of string and are connected by
Boolean expression, such as AND, OR, NOT.
Example of Boolean Model
Documents
Antony Julius The Hamlet Othello Macbeth
and Caesar Tempest
Cleopatr
a
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Keywords

Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0

Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0

1 0 0 1 0 0

Query: Brutus AND Caesar AND NOT Calpurnia


Relevant Document: Antony and Cleopatra and Hamlet
Vector Space Model
• The model is represented by a term-document matrix.
• The matrix contains the “weight” of term in the document.

Documents

Doc1 Doc2 Doc3


Car 27 4 24
Terms

Auto 3 33 0
Insurance 0 33 29
Best 14 0 17

Term Frequency Matrix


General Process in VSM

Raw Text Term Feature Classificatio


Document Preprocessing
Weighting Reduction n/Clustering

1. Tokenization 1. Term Frequency


2. N-Gram 2. TFIDF (Term Freq. Invers Doc. Freq.
3. Stemming
4. Stopword Removal
Tokenization

• Tokenization is the process of chopping character


streams into tokens.

Input : Friends, Romans, Countrymen, lend me your ears;


Output : Friends Romans Countrymen lend me your ears
Stopword Removal
• Stopword removal is used to remove the stop list from the
sentences. Stop list refer to the meaningless word, such
as “to be”, “and”, “or”, etc.
Stemming
• Stemming is used to remove the beginning or the end of
word into their basic root word.

car, cars, car’s, cars’ ⇒ car


Term Weighting
• TFIDF is the famous term weighting scheme.

wt ,d  tf t ,d  idf t
𝑡𝑓 is term frequency

N 𝑁 is total document
idf t  log 𝑑𝑓 is document frequency,
df t total document where term t occurred
Examples for idf

tf is used to measures how frequently a term occurs in a document.

idf is used to measures how important a term is.

term dft idft


1.000.000 calpurnia 1 6
idf t  log animal 100 4
df t
sunday 1000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
TFIDF Matrix
Problem Current Methods Proposed Results
Methods
• The term frequency (TF) in a Supervised Term 1. TF-IGM (Term It is proved by
document is obviously more precise Weighting (STW) schemes Frequency - extensive experiments
and reasonable than the binary 1. TF-CHI (Term Inverse on public benchmark
value. Frequency – Chi Gravity datasets that TF-IGM is
• Key words often appear in the Square) Moment) consistently superior to
document frequently and they should 2. TF-IG (Term 2. RTF-IGM the famous TF-IDF and
be assigned greater weights than the Frequency – (Root Term the state-of-the-art
rare words. Information Gain) Frequency - supervised term
• TF may assign large weights to the 3. TF-GR (Term Inverse weighting schemes.
common words with weak text Frequency – Gain Gravity
discriminating power. Ratio) Moment)
• TF-IDF does not adopt the known 4. TF-RF (Televance
class information of training text Frequency)
while weighting a term, so the 5. TF-Prob
computed weight cannot fully reflect 6. TF-ICF
the term’s importance in text 7. TF-IDF-ICF
classification. 8. TF-IDF-ICSDF
• The traditional TF-IDF (term
frequency & inverse document
frequency) is not fully effective for
text classification.
• Supervised Term Weighting
schemes consider only term
distribution in two classes of the
positive and negative text.

Das könnte Ihnen auch gefallen