Sie sind auf Seite 1von 5

A Comparative Evaluation of Term Weighting Methods

for Information Filtering

Nikolaos Nanas Victoria Uren Anne de Roeck


Knowledge Media Institute Knowledge Media Institute Department of Computing
The Open University The Open University The Open University
Milton Keynes, MK7 6AA Milton Keynes, MK7 6AA Milton Keynes, MK7 6AA
U.K. U.K. U.K.

Users of information filtering systems can not be ex- cuss term weighting in IR and TC, and then we introduce
pected to provide large amounts of information to initial- a term weighting method, specifically devised with IF in
ize a profile. Therefore, term weighting methods for in- mind (sec. 3). The methods are evaluated in section 4 using
formation filtering have somewhat different requirements a slight modification of TREC’s routing subtask. The results
to those for information retrieval and text categorization. indicate that methods from TC, that assign more importance
We present a comparative evaluation of term weighting to information provided by the user, are more appropriate
methods, including a new method, relative document fre- for IF than methods from IR. The introduced method ex-
quency, designed specifically for information filtering. The hibits satisfactory performance.
best weighting methods appear to be those that favor infor-
mation provided by the user, over information from a gen-
eral collection. 2. Term Weighting in IR and TC

The usage of words in language exhibits statistical regu-


1. Introduction larities, which are also reflected in textual information. Sta-
tistical term weighting methods can exploit variations in the
It is generally acknowledged, that it is extremely difficult distribution of terms within user-specified documents about
for people to find interesting information within the ever in- a topic and within a complete document collection to mea-
creasing volume available. Research in Information Filter- sure how specific terms are to that topic. A term’s distribu-
ing (IF) tackles the problem of information overload by fil- tion can be expressed in terms of a contingency table (ta-
tering documents through a user profile, a tailored represen- ble 1) [17].
tation of the user interests. Many approaches to user profile In IR, term weighting has been mainly concerned with
representation have been proposed [5]. Typically, they use the representation of the document space (automatic index-
weighted terms, extracted from a set of user specified docu- ing) [7]. The corresponding methods are only based on term
ments. statistics in the complete document collection and hence,
Term weighting exploits statistical regularities in writ- they cannot explicitly measure the specificity of terms re-
ten language to assign importance weights to terms. Based garding the topics of interest to the user. Nevertheless, when
on the assigned weights, the most informative terms in a profile terms are selected out of the terms in the user speci-
set of user-specified documents can be selected to populate fied documents, one takes implicitly into account relevance
the initial profile, or for maintaining an existing one. Most information. In this way, methods like Inverse Document
existing term weighting methods have been introduced and Frequency (IDF) [16] and Residual Inverse Document Fre-
evaluated in the context of Information Retrieval (IR) and quency (RIDF) [8] can be applied to select for profile ini-
Text Categorization (TC). However, in IF, users are ex- tialisation, terms in the user specified documents that are
pected to specify a small number of documents for profile specific in the complete collection.
initialisation. This can negatively affect the performance of Weighting of query terms can be accomplished if addi-
existing term weighting methods. tional relevance information is available. The latter is ac-
To investigate this claim, we have conducted a com- quired through user feedback to the documents retrieved by
parative evaluation of term weighting methods in the con- the query. Robertson and Sparck Jones introduced a series
text of IF. More specifically, in the next section we dis- of four such methods [13]. We have experimented with the

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA’04)
1529-4188/04 $ 20.00 IEEE
Documents Method Formula
Relevant Non-relevant Collection 
+   IG =-

   
   

   
Term -      
RDFThresh =r
   ¾
   
CHI (¾ )
where
number of relevant documents the term appears in
=
      
 
 number of collection documents the term appears in    
 total number of relevant documents
F4 = log
     
 number of documents in the collection 
F1/MI = log
 
    
Table 1. Contingency table

IDF = log¾

retrospective version of the first (F1) and the predictive ver- RIDF = idf ¾ 
  
sion of the fourth (F4).
TC is concerned with the automatic classification of doc- Table 2. Term weighting methods
uments according to relatively static topic categories. As a
consequence, for each one of the topic categories, thousands in a relevant document (  ) than in a random docu-
of relevant documents may be available [9]. This allows the ment from a general collection ( ). Relative document
application of machine learning algorithms to the classifi- frequency (RelDF) measures this difference, to assign to
cation task, using, for instance, decision trees, naive Bayes, each term a weight in the interval (-1,1) (equation 1). This
nearest-neighbour, neural networks and others [15]. To fa- makes accurate estimations possible, even in the case of a
cilitate the learning process, term weighting methods, like small number of user specified documents. RelDF does not
Relevant Document Frequency Threshold (RDFThresh), In- depend on the number  of relevant documents. Although,
formation Gain (IG) [2], Mutual Information (MI) [8] and a large  provides statistical confidence, RelDF may also be
the Pearson’s ¾ chi square (CHI) test, are usually applied applied in the case of a   . For more details on RelDF
to identify those terms that are more specific to each topic see [10]1 .
category.

IF has been approached both as a specialisation of IR [3]          (1)



and of TC [11], with user profiles treated as queries and
classifiers respectively. So typically, IF systems inherit the
above methods for weighting and selecting profile terms. 4. Experimental Evaluation
Users however, neither have the time nor the inclination to
We evaluated the term weighting methods using a slight
specify a large number of relevant documents. A small set
variation of the TREC-2001 routing subtask. Our goal was
can be easily compiled from either saved documents or the
to comply with an existing and well established evaluation
user’s bookmarks. It provides both a pool of candidate terms
methodology as much as possible, while at the same time to
and the statistics for identifying, in combination with col-
take into account the small number of user-specified docu-
lection statistics, the most informative of them. However,
ments.
the expected small number (e.g. 30) of user specified docu-
ments distinguishes IF from IR and TC. It provides more in-
formation than a query, but less than the thousands that can 4.1. Experimental Setup
be available for training classifiers. In the next section, we
present a term weighting method that takes this character- The Text REtrieval Conference (TREC) has been held
istic of IF into account and then compare it experimentally annually since 1992 and its purpose is to provide a standard
to the aforementioned methods on an IF problem. Similar infrastructure for the large-scale evaluation of IR systems.
comparative experiments on an m-ary classification prob- TREC-2001 adopts the Reuters Corpus Volume 1 (RCV1).
lem have already been conducted [18]. Table 2 presents for The latter is an archive of 806,791 English language news
each method its formula using the notation of the contin- stories that recently has been made freely available for re-
gency table (table 1). Note that F1 and MI have the same search purposes2. The stories have been manually catego-
formula. rized according to topic, region, and industry sector [14].

1 It was brought to our attention [1] that RelDF may be derived from
3. Relative Document Frequency Rocchio’s algorithm if the complete collection is regarded as non-
relevant. It is also discussed in [12].
As Edmundson and Wyllys [4], we assume that terms 2 http://about.reuters.com/researchandstandards/corpus/
pertaining to a topic of interest are more likely to appear index.asp

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA’04)
1529-4188/04 $ 20.00 IEEE
The TREC-2001 filtering track is based on 84 out of the We have also experimented with evaluating a document
103 RCV1 topic categories. Furthermore, it divides RCV1 by the product of the weights of profile terms that it con-
into 23,864 training stories and a test set comprising the rest tains. In this case a document’s relevance  was calculated
of the stories3 . by equation 3. We use this relatively ad hoc approach as an
According to TREC’s routing task systems are allowed additional way of uniformly comparing the term weighting
to use the complete relevance information and any non- methods. The product of term weights however, is mono-
relevance-related information from the training set. Systems tonic to the number of terms only if weights are greater
are evaluated on the basis of the best 1000 scoring docu- than one. Consequently, the weights of profile terms have
ments, using the average uninterpolated precision (AUP) been scaled so that no weight is less than one. The draw-
measure. The AUP is defined as the sum of the precision back of the multiplication approach is that it can overesti-
value at each point in the list where a relevant document ap- mate the relevance of a document containing too many pro-
pears, divided by the total number of relevant documents. file terms, even if these terms are not the most informative


The AUP produces scores with absolute values that depend terms in the profile.
on the number of relevant document per topic.
For our experiments we concentrated on the first 10 out 
¾   
 

 (3)
of the 84 TREC topics (R1-R10). Furthermore, to more re-
alistically reflect the amount of relevance information that
a user is expected to provide for each topic of interest, we 4.2. Results
have performed a series of experiments using only the first
10, 20, 30 and 40 relevant documents per topic – far less Each profile was used to evaluate the documents in the
than the hundreds provided for most of the topics by the test set. Figures 1 and 2 present the results for summation
training set. and multiplication of weights, respectively. In these graphs,
To reduce the space of unique terms in the relevant doc- the x-axis corresponds to the number of profile terms and
uments, they were preprocessed with stop word removal the y-axis to the average AUP score over all topics (R1-R10)
and stemming using Porter’s algorithm. Each of the meth- and numbers of relevant documents (10,20,30,40). A differ-
ods was then applied to weight the remaining terms and a ent line has been plotted for each term weighting method. In
topic specific profile was constructed using the most in- table 3, each method’s score for different numbers of profile
formative terms. could take one of the following values: terms has been averaged to a single overall score value. Fi-
2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 40, 60, 80 and 100. nally, in the case of summation of weights, table 4 presents
In summary, a different topic-specific profile was con- for each weighting method the per topic average over differ-
structed for each possible combination of term weighting ent numbers of profile terms and relevant documents. The
method, topic, number of relevant documents and number corresponding results for multiplication of weights have not
of profile terms. In total 1120 profiles were evaluated for been included due to limited space.
each of the term weighting methods. The results reveal a significant difference in the perfor-
The profiles were then used to assess the relevance of mance levels of IG, RelDF, RDFThresh and CHI in com-
the documents in the test set. Independence between pro- parison to F4, F1/MI, IDF and RIDF. In other words, meth-
file terms and binary indexing of documents were assumed ods from TC appear to perform better than methods from
to keep the evaluation methodology simple. For each profile IR. These first four methods are those biased towards the
 and document  , two different evaluation functions were information provided by the user. They favor terms that ap-
adopted. In the first case, documents were evaluated accord- pear in a lot of relevant documents to those appearing in


ing to the inner product measure [6]. A document’s rele-
vance  was calculated as       
,
only a few. In contrast, the smoothing effect of logarithm,
in combination with the small number of user specified doc-
where  and  are respectively the weights of a term uments and the substantially larger number of documents in
 in the profile and in the document and
is the num- the collection, biases F4 and F1/MI towards information ac-
ber of terms in  . Since binary indexing of documents was quired from the collection. Large differences in the docu-
assumed the previous equation can be simplified to equa- ment frequency of terms are more strongly taken into ac-
tion 2. In that sense a document’s relevance is calculated as count than small differences in their relevant document fre-
quency. This negative effect of algorithmic smoothing is ev-
 ¾   
the sum of the weights of profile terms that it contains.
ident in the difference between the performance of RelDF

and F1/MI. Although both methods use the same statistics,
  

(2)
the application of logarithms results in reduced performance
for F1/MI. The importance of the user specified informa-
3 For more details on the TREC 2001 filtering track see: tion is also highlighted by the poor performance of IDF and
http://trec.nist.gov/data/t10 filtering/T10filter guide.htm RIDF that do not take into account the relevant document

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA’04)
1529-4188/04 $ 20.00 IEEE
0.09 0.09

0.08 0.08

0.07 0.07

0.06 IG 0.06 IG
RelDF RelDF
Average AUP

Average AUP
RDFThresh RDFThresh
0.05 0.05
CHI CHI
F4 F4
0.04 0.04
F1/MI F1/MI
IDF IDF
0.03 RIDF 0.03 RIDF

0.02 0.02

0.01 0.01

0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Number of profile terms Number of profile terms

Figure 1. Results for summation of weights Figure 2. Results for multiplication of weights
(eq. 2) (eq. 3)

Evaluation Function
Method Sum (eq. 2) Product (eq. 3) cure to treat documents not pertaining to a certain topic as
IG 0.07346 0.04926 non-relevant to that topic, we have already noted that in our
RelDF 0.05629 0.03366 case not all of the documents in the training set that pertain
RDFThresh 0.04392 0.02707 to a certain topic are used for the construction of the corre-
CHI 0.02574 0.0249 sponding profile.
F4 0.00482 0.00346 Out of the methods from IR, F4 is the best performing
F1/MI 0.00352 0.00285 one. Its superior performance over F1 confirms the Robert-
IDF 0.0024 0.00197 son and Sparck Jones’ results [13]. IDF and RIDF are the
RIDF 0.00186 0.00168 worst performing approaches probably because they do not
Table 3. Overall Score explicitly exploit information supplied by the user. It is
however interesting to note that RIDF performs slightly bet-
ter than the rest of the IR methods for small number of ex-
frequency of terms. However, the information provided by tracted terms. This characteristic of RIDF can be attributed
the user is not sufficient on its own for superior perfor- to its Poisson distribution component that takes into account
mance. Despite the fact that RDFThresh performs substan- the frequency of occurrence of terms in the user specified
tially better than IDF and RIDF, RelDF performs even bet- documents. As a result the user provided information influ-
ter. The difference in their performance is apparently due to ences to some extent the weighting of terms.
the collection statistic that RelDF takes into account (sec-
According to table 4, IG achieves the best score for 6 out
ond fraction of equation 1).
of the 10 topics and RelDF for the rest 4 of them. Signifi-
Table 3 presents the evaluated methods by decreasing cant differences in score between methods from TC and IR
order of overall score. IG is the best performing approach are observed in all 10 topics.
while RelDF represents a promising alternative. Despite its
simplicity, the competitive performance of RDFThresh is The results using multiplication of weights (fig. 2) are, as
not surprising. RDFThresh takes into account the important expected, worse than those using summation (fig. 1). Nev-
user-provided information. In addition, its results are anal- ertheless, in both cases the behavior of the evaluated meth-
ogous to those presented by [18], for its m-ary counterpart, ods is analogous both in terms of relative performance and
document frequency. It is the performance of CHI that is in terms of performance trend. Therefore, both document
relatively unexpected. CHI is the worst of the four meth- evaluation approaches confirm the above findings. This is
ods from TC. This is possibly due to CHI’s m-ary nature. expected, since the assignment of weights is done irrespec-
While in an m-ary classification problem it is usually se- tive of the adopted document evaluation approach.

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA’04)
1529-4188/04 $ 20.00 IEEE
Topic IG RelDF  RDF CHI F4 F1/MI IDF RIDF
R1 0.001119 0.00135 0.00109 0.00011 0.00016 0.00016 0.00017 0.00018
R2 0.046944 0.04793 0.03687 0.01139 0.00577 0.0045 0.00247 0.00009
R3 0.001420 0.00101 0.00043 0.0002 0.00024 0.00023 0.00023 0.00004
R4 0.028700 0.01666 0.00964 0.00016 0.00028 0.00043 0.00047 0.00002
R5 0.022331 0.01118 0.01373 0.01189 0.00381 0.00306 0.0011 0.00564
R6 0.196137 0.07312 0.02395 0.12679 0.00889 0.00493 0.00297 0.00002
R7 0.025712 0.02662 0.02006 0.00099 0.00183 0.00141 0.00099 0.00008
R8 0.066312 0.06509 0.06123 0.02735 0.00703 0.00723 0.00617 0.01108
R9 0.155927 0.12031 0.08278 0.05222 0.00364 0.0018 0.00065 0.00004
R10 0.141035 0.16219 0.1602 0.0092 0.01334 0.00917 0.00722 0.00024
Table 4. Per topic average for summation of weights (eq. 2)

5. Summary and Conclusions [6] W. P. Jones and G. W. Furnas. Pictures of relevance: A geo-
metric analysis of similarity measures. Journal of the Amer-
In this paper, we have presented an evaluation of a large ican Society of Information Science, 38(6):420–442, May
number of term weighting methods for the problem of iden- 1986.
tifying informative terms within a small number of user- [7] K. Kageura and B. Umino. Method of automatic term recog-
specified documents about a topic of interest. The extracted nition. Terminology, 3(2):259–290, 1996.
terms can be used either to populate a user’s initial profile [8] C. D. Manning and H. Schutze. Foundations of Statistical
or to update an existing one. Existing methods, that have Natural Language Processing. MIT Press, 1999.
been introduced in the context of IR and TC, and a new [9] M. Moens and J. Dumortier. Text categorization: the assigne-
term weighting approach (RelDF), have been evaluated on ment of subject descriptors to magazine articles. Information
an appropriate modification of the TREC-2001 routing sub- Processing and Management., 36(6):841–861, 2000.
task. [10] N. Nanas, V. Uren, A. D. Roeck, and J. Domingue. Build-
The results indicate that methods from TC are more ap- ing and applying a concept hierarchy representation of a user
propriate for IF than methods from IR. These methods fa- profile. In 26th Annual International ACM SIGIR Confer-
vor information provided by the user, over information from ence on Research and Development in Information Retrieval,
the document collection. IG is the best performing approach pages 198–204. ACM press, 2003.
while RelDF appears to be a promising alternative. The re- [11] M. J. Pazzani. Representation of electronic mail filtering pro-
sults can be used as evidence for the appropriate choice of a files: A user study. In International Conference on Intelligent
term weighting method by IF systems. In addition the easy User Interfaces., New Orleans, LA USA, 2000.
reproduction of the experimental setup and the basic docu- [12] M. F. Porter. Implementing a probabilistic information re-
ment evaluation functions that have been adopted allow the trieval system. Information Technology: Research and De-
use of the results as a baseline for comparisons with more velopment, 1:131–156, 1982.
elaborate IF approaches. [13] S. E. Robertson and K. Sparck Jones. Relevance weighting
of search terms. Journal of the American Society for Infor-
References mation Science, 27:129–146, 1976.
[14] T. Rose, M. Stevenson, and M. Whitehead. The Reuters Cor-
[1] A. Arampatzis. Personal communication, 2003. pus Volume 1 - from yesterday’s news to tomorrow’s lan-
[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. guage resources. In 3rd International Conference on Lan-
Classification and Regression Trees. Wadsworth Interna- guage Resources and Evaluation, 2002.
tional Group, Belmont, CA, 1984. [15] F. Sebastiani. Machine learning in automated text catego-
[3] J. Callan. Learning while filtering documents. In 21st Annual rization. ACM Computing Surveys, 34(1), 2002.
International ACM SIGIR Conference on Research and De- [16] K. Sparck Jones. A statistical interpretation of term speci-
velopment in Information Retrieval., pages 224–231, 1998. ficity and its application in retrieval. Journal of Documenta-
[4] H. P. Edmundson and R. E. Wyllys. Automatic abstracting tion, 28(1):11–20, 1972.
and indexing - survey and recommendations. Communica- [17] C. J. van Rijsbergen. Information Retrieval. Butterworths,
tions of the ACM, 4(5):226–234, 1961. London, 2nd edition, 1979.
[5] U. Hanani, B. Shapira, and P. Shoval. Information filtering: [18] Y. Yang and J. O. Pedersen. A comparative study on feature
Overview of issues, research and systems. User Modeling selection in text categorization. In 14th International Con-
and User-Adapted Interaction, 11:203–259, 2001. ference on Machine Learning (ICML ’97), 1997.

Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA’04)
1529-4188/04 $ 20.00 IEEE

Das könnte Ihnen auch gefallen