Sie sind auf Seite 1von 47

Machine Learning in Automated Text Categorization

FABRIZIO SEBASTIANI
Consiglio Nazionale delle Ricerche, Italy

The automated categorization (or classification) of texts into predefined categories has
witnessed a booming interest in the last 10 years, due to the increased availability of
documents in digital form and the ensuing need to organize them. In the research
community the dominant approach to this problem is based on machine learning
techniques: a general inductive process automatically builds a classifier by learning,
from a set of preclassified documents, the characteristics of the categories. The
advantages of this approach over the knowledge engineering approach (consisting in
the manual definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert labor power, and straightforward portability to
different domains. This survey discusses the main approaches to text categorization
that fall within the machine learning paradigm. We will discuss in detail issues
pertaining to three different problems, namely, document representation, classifier
construction, and classifier evaluation.

Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]:


Content Analysis and IndexingIndexing methods; H.3.3 [Information Storage and
Retrieval]: Information Search and RetrievalInformation filtering; H.3.4
[Information Storage and Retrieval]: Systems and SoftwarePerformance
evaluation (efficiency and effectiveness); I.2.6 [Artificial Intelligence]: Learning
Induction
General Terms: Algorithms, Experimentation, Theory
Additional Key Words and Phrases: Machine learning, text categorization, text
classification

1. INTRODUCTION texts with thematic categories from a pre-


defined set, is one such task. TC dates
In the last 10 years content-based doc- back to the early 60s, but only in the early
ument management tasks (collectively 90s did it become a major subfield of the
known as information retrievalIR) have information systems discipline, thanks to
gained a prominent status in the informa- increased applicative interest and to the
tion systems field, due to the increased availability of more powerful hardware.
availability of documents in digital form TC is now being applied in many contexts,
and the ensuing need to access them in ranging from document indexing based
flexible ways. Text categorization (TC on a controlled vocabulary, to document
a.k.a. text classification, or topic spotting), filtering, automated metadata generation,
the activity of labeling natural language word sense disambiguation, population of

Authors address: Istituto di Elaborazione dellInformazione, Consiglio Nazionale delle Ricerche, Via G.
Moruzzi 1, 56124 Pisa, Italy; e-mail: fabrizio@iei.pi.cnr.it.
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted
without fee provided that the copies are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication, and its date appear, and notice is given that copying is by
permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists,
requires prior specific permission and/or a fee.
2002
c ACM 0360-0300/02/0300-0001 $5.00

ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 147.
2 Sebastiani

hierarchical catalogues of Web resources, this topic [Joachims and Sebastiani 2002;
and in general any application requiring Lewis and Hayes 1994], there are no sys-
document organization or selective and tematic treatments of the subject: there
adaptive document dispatching. are neither textbooks nor journals en-
Until the late 80s the most popular ap- tirely devoted to TC yet, and Manning
proach to TC, at least in the operational
and Schutze [1999, Chapter 16] is the only
(i.e., real-world applications) community, chapter-length treatment of the subject.
was a knowledge engineering (KE) one, As a note, we should warn the reader
consisting in manually defining a set of that the term automatic text classifica-
rules encoding expert knowledge on how tion has sometimes been used in the liter-
to classify documents under the given cat- ature to mean things quite different from
egories. In the 90s this approach has in- the ones discussed here. Aside from (i) the
creasingly lost popularity (especially in automatic assignment of documents to a
the research community) in favor of the predefined set of categories, which is the
machine learning (ML) paradigm, accord- main topic of this paper, the term has also
ing to which a general inductive process been used to mean (ii) the automatic iden-
automatically builds an automatic text tification of such a set of categories (e.g.,
classifier by learning, from a set of preclas- Borko and Bernick [1963]), or (iii) the au-
sified documents, the characteristics of the tomatic identification of such a set of cat-
categories of interest. The advantages of egories and the grouping of documents
this approach are an accuracy comparable under them (e.g., Merkl [1998]), a task
to that achieved by human experts, and usually called text clustering, or (iv) any
a considerable savings in terms of expert activity of placing text items into groups,
labor power, since no intervention from ei- a task that has thus both TC and text clus-
ther knowledge engineers or domain ex- tering as particular instances [Manning
perts is needed for the construction of the
and Schutze 1999].
classifier or for its porting to a different set This paper is organized as follows. In
of categories. It is the ML approach to TC Section 2 we formally define TC and its
that this paper concentrates on. various subcases, and in Section 3 we
Current-day TC is thus a discipline at review its most important applications.
the crossroads of ML and IR, and as Section 4 describes the main ideas under-
such it shares a number of characteris- lying the ML approach to classification.
tics with other tasks such as information/ Our discussion of text classification starts
knowledge extraction from texts and text in Section 5 by introducing text index-
mining [Knight 1999; Pazienza 1997]. ing, that is, the transformation of textual
There is still considerable debate on where documents into a form that can be inter-
the exact border between these disciplines preted by a classifier-building algorithm
lies, and the terminology is still evolving. and by the classifier eventually built by it.
Text mining is increasingly being used Section 6 tackles the inductive construc-
to denote all the tasks that, by analyz- tion of a text classifier from a training
ing large quantities of text and detect- set of preclassified documents. Section 7
ing usage patterns, try to extract probably discusses the evaluation of text classi-
useful (although only probably correct) fiers. Section 8 concludes, discussing open
information. According to this view, TC is issues and possible avenues of further
an instance of text mining. TC enjoys quite research for TC.
a rich literature now, but this is still fairly
scattered.1 Although two international
journals have devoted special issues to 2. TEXT CATEGORIZATION
2.1. A Definition of Text Categorization
1 A fully searchable bibliography on TC created and
maintained by this author is available at http://
Text categorization is the task of assigning
liinwww.ira.uka.de/bibliography/Ai/automated.text. a Boolean value to each pair hd j , ci i D
categorization.html. C, where D is a domain of documents and

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 3

C = {c1 , . . . , c|C| } is a set of predefined cat- phenomenon of inter-indexer inconsistency


egories. A value of T assigned to hd j , ci i [Cleverdon 1984]: when two human ex-
indicates a decision to file d j under ci , perts decide whether to classify document
while a value of F indicates a decision d j under category ci , they may disagree,
not to file d j under ci . More formally, the and this in fact happens with relatively
task is to approximate the unknown tar- high frequency. A news article on Clinton
get function 8 : D C {T, F } (that de- attending Dizzy Gillespies funeral could
scribes how documents ought to be classi- be filed under Politics, or under Jazz, or un-
fied) by means of a function 8 : D C der both, or even under neither, depending
{T, F } called the classifier (aka rule, or on the subjective judgment of the expert.
hypothesis, or model) such that 8 and 8
coincide as much as possible. How to pre- 2.2. Single-Label Versus Multilabel
cisely define and measure this coincidence Text Categorization
(called effectiveness) will be discussed in
Section 7.1. From now on we will assume Different constraints may be enforced on
that: the TC task, depending on the applica-
tion. For instance we might need that, for
The categories are just symbolic la- a given integer k, exactly k (or k, or k)
bels, and no additional knowledge (of elements of C be assigned to each d j D.
a procedural or declarative nature) of The case in which exactly one category
their meaning is available. must be assigned to each d j D is often
No exogenous knowledge (i.e., data pro- called the single-label (a.k.a. nonoverlap-
vided for classification purposes by an ping categories) case, while the case in
external source) is available; therefore, which any number of categories from 0
classification must be accomplished on to |C| may be assigned to the same d j D
the basis of endogenous knowledge only is dubbed the multilabel (aka overlapping
(i.e., knowledge extracted from the doc- categories) case. A special case of single-
uments). In particular, this means that label TC is binary TC, in which each d j D
metadata such as, for example, pub- must be assigned either to category ci or
lication date, document type, publica- to its complement c i .
tion source, etc., is not assumed to be From a theoretical point of view, the
available. binary case (hence, the single-label case,
too) is more general than the multilabel,
The TC methods we will discuss are since an algorithm for binary classifica-
thus completely general, and do not de- tion can also be used for multilabel clas-
pend on the availability of special-purpose sification: one needs only transform the
resources that might be unavailable or problem of multilabel classification under
costly to develop. Of course, these as- {c1 , . . . , c|C| } into |C| independent problems
sumptions need not be verified in opera- of binary classification under {ci , c i }, for
tional settings, where it is legitimate to i = 1, . . . , |C|. However, this requires that
use any source of information that might categories be stochastically independent
be available or deemed worth developing of each other, that is, for any c0 , c00 , the
[Daz Esteban et al. 1998; Junker and value of 8(d j , c0 ) does not depend on
Abecker 1997]. Relying only on endoge- the value of 8(d j , c00 ) and vice versa;
nous knowledge means classifying a docu- this is usually assumed to be the case
ment based solely on its semantics, and (applications in which this is not the case
given that the semantics of a document are discussed in Section 3.5). The converse
is a subjective notion, it follows that the is not true: an algorithm for multilabel
membership of a document in a cate- classification cannot be used for either bi-
gory (pretty much as the relevance of a nary or single-label classification. In fact,
document to an information need in IR given a document d j to classify, (i) the clas-
[Saracevic 1975]) cannot be decided de- sifier might attribute k > 1 categories to
terministically. This is exemplified by the d j , and it might not be obvious how to

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


4 Sebastiani

choose a most appropriate category from categorizationCPC). This distinction is


them; or (ii) the classifier might attribute more pragmatic than conceptual, but is
to d j no category at all, and it might not important since the sets C and D might not
be obvious how to choose a least inappro- be available in their entirety right from
priate category from C. the start. It is also relevant to the choice
In the rest of the paper, unless explicitly of the classifier-building method, as some
mentioned, we will deal with the binary of these methods (see Section 6.9) allow
case. There are various reasons for this: the construction of classifiers with a defi-
nite slant toward one or the other style.
The binary case is important in itself DPC is thus suitable when documents
because important TC applications, in- become available at different moments in
cluding filtering (see Section 3.3), con- time, e.g., in filtering e-mail. CPC is in-
sist of binary classification problems stead suitable when (i) a new category
(e.g., deciding whether d j is about Jazz c|C|+1 may be added to an existing set
or not). In TC, most binary classification C = {c1 , . . . , c|C| } after a number of docu-
problems feature unevenly populated ments have already been classified under
categories (e.g., much fewer documents C, and (ii) these documents need to be re-
are about Jazz than are not) and un- considered for classification under c|C|+1
evenly characterized categories (e.g., (e.g., Larkey [1999]). DPC is used more of-
what is about Jazz can be characterized ten than CPC, as the former situation is
much better than what is not). more common than the latter.
Solving the binary case also means solv- Although some specific techniques ap-
ing the multilabel case, which is also ply to one style and not to the other (e.g.,
representative of important TC applica- the proportional thresholding method dis-
tions, including automated indexing for cussed in Section 6.1 applies only to CPC),
Boolean systems (see Section 3.1). this is more the exception than the rule:
Most of the TC literature is couched in most of the techniques we will discuss al-
terms of the binary case. low the construction of classifiers capable
Most techniques for binary classifica- of working in either mode.
tion are just special cases of existing
techniques for the single-label case, and 2.4. Hard Categorization Versus
are simpler to illustrate than these Ranking Categorization
latter.
While a complete automation of the
This ultimately means that we will view TC task requires a T or F decision
classification under C = {c1 , . . . , c|C| } as for each pair hd j , ci i, a partial automa-
consisting of |C| independent problems of tion of this process might have different
classifying the documents in D under a requirements.
given category ci , for i = 1, . . . , |C|. A clas- For instance, given d j D a system
sifier for ci is then a function 8i : D might simply rank the categories in
{T, F } that approximates an unknown tar- C = {c1 , . . . , c|C| } according to their esti-
get function 8 i : D {T, F }. mated appropriateness to d j , without tak-
ing any hard decision on any of them.
2.3. Category-Pivoted Versus
Such a ranked list would be of great
Document-Pivoted Text Categorization
help to a human expert in charge of
taking the final categorization decision,
There are two different ways of using since she could thus restrict the choice
a text classifier. Given d j D, we might to the category (or categories) at the top
want to find all the ci C under which it of the list, rather than having to examine
should be filed (document-pivoted catego- the entire set. Alternatively, given ci C
rizationDPC); alternatively, given ci C, a system might simply rank the docu-
we might want to find all the d j D that ments in D according to their estimated
should be filed under it (category-pivoted appropriateness to ci ; symmetrically, for

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 5

classification under ci a human expert automated identification of text genre


would just examine the top-ranked doc- [Kessler et al. 1997], and automated essay
uments instead of the entire document grading [Larkey 1998].
set. These two modalities are sometimes
called category-ranking TC and document-
3.1. Automatic Indexing for Boolean
ranking TC [Yang 1999], respectively,
Information Retrieval Systems
and are the obvious counterparts of DPC
and CPC. The application that has spawned most
Semiautomated, interactive classifica- of the early research in the field [Borko
tion systems [Larkey and Croft 1996] are and Bernick 1963; Field 1975; Gray and
useful especially in critical applications Harley 1971; Heaps 1973; Maron 1961]
in which the effectiveness of a fully au- is that of automatic document indexing
tomated system may be expected to be for IR systems relying on a controlled
significantly lower than that of a human dictionary, the most prominent example
expert. This may be the case when the of which is Boolean systems. In these
quality of the training data (see Section 4) latter each document is assigned one or
is low, or when the training documents more key words or key phrases describ-
cannot be trusted to be a representative ing its content, where these key words and
sample of the unseen documents that are key phrases belong to a finite set called
to come, so that the results of a completely controlled dictionary, often consisting of
automatic classifier could not be trusted a thematic hierarchical thesaurus (e.g.,
completely. the NASA thesaurus for the aerospace
In the rest of the paper, unless explicitly discipline, or the MESH thesaurus for
mentioned, we will deal with hard classi- medicine). Usually, this assignment is
fication; however, many of the algorithms done by trained human indexers, and is
we will discuss naturally lend themselves thus a costly activity.
to ranking TC too (more details on this in If the entries in the controlled vocab-
Section 6.1). ulary are viewed as categories, text in-
dexing is an instance of TC, and may
3. APPLICATIONS OF TEXT
thus be addressed by the automatic tech-
CATEGORIZATION
niques described in this paper. Recall-
ing Section 2.2, note that this applica-
TC goes back to Marons [1961] semi- tion may typically require that k1 x k2
nal work on probabilistic text classifica- key words are assigned to each docu-
tion. Since then, it has been used for a ment, for given k1 , k2 . Document-pivoted
number of different applications, of which TC is probably the best option, so that
we here briefly review the most impor- new documents may be classified as they
tant ones. Note that the borders between become available. Various text classifiers
the different classes of applications listed explicitly conceived for document index-
here are fuzzy and somehow artificial, and ing have been described in the literature;
some of these may be considered special see, for example, Fuhr and Knorz [1984],
cases of others. Other applications we do Robertson and Harding [1984], and Tzeras
not explicitly discuss are speech catego- and Hartmann [1993].
rization by means of a combination of Automatic indexing with controlled dic-
speech recognition and TC [Myers et al. tionaries is closely related to automated
2000; Schapire and Singer 2000], multi- metadata generation. In digital libraries,
media document categorization through one is usually interested in tagging doc-
the analysis of textual captions [Sable uments by metadata that describes them
and Hatzivassiloglou 2000], author iden- under a variety of aspects (e.g., creation
tification for literary texts of unknown or date, document type or format, availabil-
disputed authorship [Forsyth 1999], lan- ity, etc.). Some of this metadata is the-
guage identification for texts of unknown matic, that is, its role is to describe the
language [Cavnar and Trenkle 1994], semantics of the document by means of

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


6 Sebastiani

bibliographic codes, key words or key a filtering system may also further clas-
phrases. The generation of this metadata sify the documents deemed relevant to
may thus be viewed as a problem of doc- the consumer into thematic categories;
ument indexing with controlled dictio- in the example above, all articles about
nary, and thus tackled by means of TC sports should be further classified accord-
techniques. ing to which sport they deal with, so as
to allow journalists specialized in indi-
vidual sports to access only documents of
3.2. Document Organization
prospective interest for them. Similarly,
Indexing with a controlled vocabulary is an e-mail filter might be trained to discard
an instance of the general problem of docu- junk mail [Androutsopoulos et al. 2000;
ment base organization. In general, many Drucker et al. 1999] and further classify
other issues pertaining to document or- nonjunk mail into topical categories of in-
ganization and filing, be it for purposes terest to the user.
of personal organization or structuring of A filtering system may be installed at
a corporate document base, may be ad- the producer end, in which case it must
dressed by TC techniques. For instance, route the documents to the interested con-
at the offices of a newspaper incoming sumers only, or at the consumer end, in
classified ads must be, prior to publi- which case it must block the delivery of
cation, categorized under categories such documents deemed uninteresting to the
as Personals, Cars for Sale, Real Estate, consumer. In the former case, the system
etc. Newspapers dealing with a high vol- builds and updates a profile for each con-
ume of classified ads would benefit from an sumer [Liddy et al. 1994], while in the lat-
automatic system that chooses the most ter case (which is the more common, and
suitable category for a given ad. Other to which we will refer in the rest of this
possible applications are the organiza- section) a single profile is needed.
tion of patents into categories for mak- A profile may be initially specified by
ing their search easier [Larkey 1999], the the user, thereby resembling a standing
automatic filing of newspaper articles un- IR query, and is updated by the system
der the appropriate sections (e.g., Politics, by using feedback information provided
Home News, Lifestyles, etc.), or the auto- (either implicitly or explicitly) by the user
matic grouping of conference papers into on the relevance or nonrelevance of the de-
sessions. livered messages. In the TREC community
[Lewis 1995c], this is called adaptive fil-
tering, while the case in which no user-
3.3. Text Filtering
specified profile is available is called ei-
Text filtering is the activity of classify- ther routing or batch filtering, depending
ing a stream of incoming documents dis- on whether documents have to be ranked
patched in an asynchronous way by an in decreasing order of estimated relevance
information producer to an information or just accepted/rejected. Batch filtering
consumer [Belkin and Croft 1992]. A typ- thus coincides with single-label TC un-
ical case is a newsfeed, where the pro- der |C| = 2 categories; since this latter is
ducer is a news agency and the consumer a completely general TC task, some au-
is a newspaper [Hayes et al. 1990]. In thors [Hull 1994; Hull et al. 1996; Schapire
this case, the filtering system should block
et al. 1998; Schutze et al. 1995], some-
the delivery of the documents the con- what confusingly, use the term filtering
sumer is likely not interested in (e.g., all in place of the more appropriate term
news not concerning sports, in the case categorization.
of a sports newspaper). Filtering can be In information science, document filter-
seen as a case of single-label TC, that ing has a tradition dating back to the
is, the classification of incoming docu- 60s, when, addressed by systems of var-
ments into two disjoint categories, the ious degrees of automation and dealing
relevant and the irrelevant. Additionally, with the multiconsumer case discussed

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 7

above, it was called selective dissemina- 3.5. Hierarchical Categorization


tion of information or current awareness of Web Pages
(see Korfhage [1997, Chapter 6]). The ex-
TC has recently aroused a lot of interest
plosion in the availability of digital infor-
also for its possible application to auto-
mation has boosted the importance of such
matically classifying Web pages, or sites,
systems, which are nowadays being used
under the hierarchical catalogues hosted
in contexts such as the creation of person-
by popular Internet portals. When Web
alized Web newspapers, junk e-mail block-
documents are catalogued in this way,
ing, and Usenet news selection.
rather than issuing a query to a general-
Information filtering by ML techniques
purpose Web search engine a searcher
is widely discussed in the literature: see
may find it easier to first navigate in
Amati and Crestani [1999], Iyer et al.
the hierarchy of categories and then re-
[2000], Kim et al. [2000], Tauritz et al.
strict her search to a particular category
[2000], and Yu and Lam [1998].
of interest.
Classifying Web pages automatically
has obvious advantages, since the man-
3.4. Word Sense Disambiguation ual categorization of a large enough sub-
set of the Web is infeasible. Unlike in the
Word sense disambiguation (WSD) is the
previous applications, it is typically the
activity of finding, given the occurrence in
case that each category must be populated
a text of an ambiguous (i.e., polysemous
by a set of k1 x k2 documents. CPC
or homonymous) word, the sense of this
should be chosen so as to allow new cate-
particular word occurrence. For instance,
gories to be added and obsolete ones to be
bank may have (at least) two different
deleted.
senses in English, as in the Bank of
With respect to previously discussed TC
England (a financial institution) or the
applications, automatic Web page catego-
bank of river Thames (a hydraulic engi-
rization has two essential peculiarities:
neering artifact). It is thus a WSD task
to decide which of the above senses the oc- (1) The hypertextual nature of the doc-
currence of bank in Last week I borrowed uments: Links are a rich source of
some money from the bank has. WSD is information, as they may be under-
very important for many applications, in- stood as stating the relevance of the
cluding natural language processing, and linked page to the linking page. Tech-
indexing documents by word senses rather niques exploiting this intuition in a
than by words for IR purposes. WSD may TC context have been presented by
be seen as a TC task (see Gale et al. Attardi et al. [1998], Chakrabarti et al.
[1993]; Escudero et al. [2000]) once we
[1998b], Furnkranz [1999], Govert
view word occurrence contexts as doc- et al. [1999], and Oh et al. [2000]
uments and word senses as categories. and experimentally compared by Yang
Quite obviously, this is a single-label TC et al. [2002].
case, and one in which document-pivoted (2) The hierarchical structure of the cate-
TC is usually the right choice. gory set: This may be used, for example,
WSD is just an example of the more gen- by decomposing the classification prob-
eral issue of resolving natural language lem into a number of smaller classifica-
ambiguities, one of the most important tion problems, each corresponding to a
problems in computational linguistics. branching decision at an internal node.
Other examples, which may all be tackled Techniques exploiting this intuition in
by means of TC techniques along the lines a TC context have been presented by
discussed for WSD, are context-sensitive Dumais and Chen [2000], Chakrabarti
spelling correction, prepositional phrase et al. [1998a], Koller and Sahami
attachment, part of speech tagging, and [1997], McCallum et al. [1998], Ruiz
word choice selection in machine transla- and Srinivasan [1999], and Weigend
tion; see Roth [1998] for an introduction. et al. [1999].

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


8 Sebastiani

if ((wheat & farm) or


(wheat & commodity) or
(bushels & export) or
(wheat & tonnes) or
(wheat & winter & soft)) then WHEAT else WHEAT

Fig. 1. Rule-based classifier for the WHEAT category; key words


are indicated in italic, categories are indicated in SMALL CAPS (from
Apte et al. [1994]).

4. THE MACHINE LEARNING APPROACH even the best classifiers built in the late
TO TEXT CATEGORIZATION 90s by state-of-the-art ML techniques.
However, no other classifier has been
In the 80s, the most popular approach
tested on the same dataset as CONSTRUE,
(at least in operational settings) for the
and it is not clear whether this was a
creation of automatic document classifiers
randomly chosen or a favorable subset of
consisted in manually building, by means
the entire Reuters collection. As argued
of knowledge engineering (KE) techniques,
by Yang [1999], the results above do not
an expert system capable of taking TC de-
allow us to state that these effectiveness
cisions. Such an expert system would typ-
results may be obtained in general.
ically consist of a set of manually defined
Since the early 90s, the ML approach
logical rules, one per category, of type
to TC has gained popularity and has
if hDNF formulai then hcategoryi. eventually become the dominant one, at
least in the research community (see
A DNF (disjunctive normal form) for- Mitchell [1996] for a comprehensive intro-
mula is a disjunction of conjunctive duction to ML). In this approach, a general
clauses; the document is classified under inductive process (also called the learner)
hcategoryi iff it satisfies the formula, that automatically builds a classifier for a cat-
is, iff it satisfies at least one of the clauses. egory ci by observing the characteristics
The most famous example of this approach of a set of documents manually classified
is the CONSTRUE system [Hayes et al. 1990], under ci or c i by a domain expert; from
built by Carnegie Group for the Reuters these characteristics, the inductive pro-
news agency. A sample rule of the type cess gleans the characteristics that a new
used in CONSTRUE is illustrated in Figure 1. unseen document should have in order to
The drawback of this approach is be classified under ci . In ML terminology,
the knowledge acquisition bottleneck well the classification problem is an activity
known from the expert systems literature. of supervised learning, since the learning
That is, the rules must be manually de- process is supervised by the knowledge
fined by a knowledge engineer with the of the categories and of the training in-
aid of a domain expert (in this case, an stances that belong to them.2
expert in the membership of documents in The advantages of the ML approach
the chosen set of categories): if the set of over the KE approach are evident. The en-
categories is updated, then these two pro- gineering effort goes toward the construc-
fessionals must intervene again, and if the tion not of a classifier, but of an automatic
classifier is ported to a completely differ- builder of classifiers (the learner). This
ent domain (i.e., set of categories), a differ- means that if a learner is (as it often is)
ent domain expert needs to intervene and available off-the-shelf, all that is needed
the work has to be repeated from scratch. is the inductive, automatic construction of
On the other hand, it was originally a classifier from a set of manually clas-
suggested that this approach can give very sified documents. The same happens if a
good effectiveness results: Hayes et al.
[1990] reported a .90 breakeven result 2 Within the area of content-based document man-
(see Section 7) on a subset of the Reuters agement tasks, an example of an unsupervised learn-
test collection, a figure that outperforms ing activity is document clustering (see Section 1).

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 9

classifier already exists and the original built by observing the characteristics of
set of categories is updated, or if the clas- these documents;
sifier is ported to a completely different a test set Te = {d |T V |+1 , . . . , d || }, used
domain. for testing the effectiveness of the clas-
In the ML approach, the preclassified sifiers. Each d j Te is fed to the classi-
documents are then the key resource. fier, and the classifier decisions 8(d j , ci )
In the most favorable case, they are al- are compared with the expert decisions
ready available; this typically happens for 8(d
j , ci ). A measure of classification
organizations which have previously car- effectiveness is based on how often
ried out the same categorization activity the 8(d j , ci ) values match the 8(d j , ci )
manually and decide to automate the pro- values.
cess. The less favorable case is when no
manually classified documents are avail- The documents in T e cannot participate
able; this typically happens for organi- in any way in the inductive construc-
zations that start a categorization activ- tion of the classifiers; if this condition
ity and opt for an automated modality were not satisfied, the experimental re-
straightaway. The ML approach is more sults obtained would likely be unrealis-
convenient than the KE approach also in tically good, and the evaluation would
this latter case. In fact, it is easier to man- thus have no scientific character [Mitchell
ually classify a set of documents than to 1996, page 129]. In an operational setting,
build and tune a set of rules, since it is after evaluation has been performed one
easier to characterize a concept extension- would typically retrain the classifier on
ally (i.e., to select instances of it) than in- the entire initial corpus, in order to boost
tensionally (i.e., to describe the concept in effectiveness. In this case, the results of
words, or to describe a procedure for rec- the previous evaluation would be a pes-
ognizing its instances). simistic estimate of the real performance,
Classifiers built by means of ML tech- since the final classifier has been trained
niques nowadays achieve impressive lev- on more data than the classifier evaluated.
els of effectiveness (see Section 7), making This is called the train-and-test ap-
automatic classification a qualitatively proach. An alternative is the k-fold cross-
(and not only economically) viable alter- validation approach (see Mitchell [1996],
native to manual classification. page 146), in which k different classi-
fiers 81 , . . . , 8k are built by partition-
4.1. Training Set, Test Set, and ing the initial corpus into k disjoint sets
Validation Set T e1 , . . . , T ek and then iteratively apply-
The ML approach relies on the availabil- ing the train-and-test approach on pairs
ity of an initial corpus = {d 1 , . . . , d || } hT Vi = Tei , Tei i. The final effectiveness
D of documents preclassified under C = figure is obtained by individually comput-
{c1 , . . . , c|C| }. That is, the values of the total ing the effectiveness of 81 , . . . , 8k , and
function 8 : D C {T, F } are known for then averaging the individual results in
every pair hd j , ci i C. A document d j some way.
is a positive example of ci if 8(d j , ci ) = T , In both approaches, it is often the case
a negative example of ci if 8(d j , ci ) = F .
that the internal parameters of the clas-
In research settings (and in most opera- sifiers must be tuned by testing which
tional settings too), once a classifier 8 has values of the parameters yield the best
been built it is desirable to evaluate its ef- effectiveness. In order to make this op-
fectiveness. In this case, prior to classifier timization possible, in the train-and-test
construction the initial corpus is split in approach the set {d 1 , . . . , d |T V | } is further
two sets, not necessarily of equal size: split into a training set Tr = {d 1 , . . . , d |Tr| },
from which the classifier is built, and a val-
a training(-and-validation) set T V = idation set Va = {d |Tr|+1 , . . . , d |T V | } (some-
{d 1 , . . . , d |T V | }. The classifier 8 for cat- times called a hold-out set), on which
egories C = {c1 , . . . , c|C| } is inductively the repeated tests of the classifier aimed

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


10 Sebastiani

at parameter optimization are performed; approaches to (1) and (3) are also used. In-
the obvious variant may be used in the dexing, induction, and evaluation are the
k-fold cross-validation case. Note that, for themes of Sections 5, 6 and 7, respectively.
the same reason why we do not test a clas-
sifier on the documents it has been trained 5. DOCUMENT INDEXING AND
on, we do not test it on the documents it DIMENSIONALITY REDUCTION
has been optimized on: test set and vali-
dation set must be kept separate.3 5.1. Document Indexing
Given a corpus , one may define the Texts cannot be directly interpreted by a
generality g (ci ) of a category ci as the classifier or by a classifier-building algo-
percentage of documents that belong to ci , rithm. Because of this, an indexing proce-
that is: dure that maps a text d j into a compact
representation of its content needs to be
|{d j | 8(d
j , ci ) = T }| uniformly applied to training, validation,
g (ci ) = .
|| and test documents. The choice of a rep-
resentation for text depends on what one
The training set generality g Tr (ci ), valida- regards as the meaningful units of text
tion set generality g Va (ci ), and test set gen- (the problem of lexical semantics) and the
erality g Te (ci ) of ci may be defined in the meaningful natural language rules for the
obvious way. combination of these units (the problem
of compositional semantics). Similarly to
what happens in IR, in TC this latter prob-
4.2. Information Retrieval Techniques
lem is usually disregarded,4 and a text
and Text Categorization
d j is usually represented as a vector of
Text categorization heavily relies on the term weights dE j = hw1 j , . . . , w|T | j i, where
basic machinery of IR. The reason is that T is the set of terms (sometimes called
TC is a content-based document manage- features) that occur at least once in at least
ment task, and as such it shares many one document of Tr, and 0 wk j 1 rep-
characteristics with other IR tasks such resents, loosely speaking, how much term
as text search. tk contributes to the semantics of docu-
IR techniques are used in three phases ment d j . Differences among approaches
of the text classifier life cycle: are accounted for by
(1) IR-style indexing is always performed (1) different ways to understand what a
on the documents of the initial corpus term is;
and on those to be classified during the (2) different ways to compute term
operational phase; weights.
(2) IR-style techniques (such as docu-
ment-request matching, query refor- A typical choice for (1) is to identify terms
mulation, . . .) are often used in the in- with words. This is often called either the
ductive construction of the classifiers; set of words or the bag of words approach
to document representation, depending on
(3) IR-style evaluation of the effectiveness whether weights are binary or not.
of the classifiers is performed. In a number of experiments [Apte
The various approaches to classification et al. 1994; Dumais et al. 1998; Lewis
differ mostly for how they tackle (2), 1992a], it has been found that represen-
although in a few cases nonstandard tations more sophisticated than this do
not yield significantly better effectiveness,
thereby confirming similar results from IR
3 From now on, we will take the freedom to use the
expression test document to denote any document
not in the training set and validation set. This in- 4 An exception to this is represented by learning ap-
cludes thus any document submitted to the classifier proaches based on hidden Markov models [Denoyer
in the operational phase. et al. 2001; Frasconi et al. 2002].

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 11

[Salton and Buckley 1988]. In particular, Lewis et al. [1996]), and for ease of expo-
some authors have used phrases, rather sition we will assume they always do. As a
than individual words, as indexing terms special case, binary weights may be used

[Fuhr et al. 1991; Schutze et al. 1995; (1 denoting presence and 0 absence of the
Tzeras and Hartmann 1993], but the ex- term in the document); whether binary or
perimental results found to date have nonbinary weights are used depends on
not been uniformly encouraging, irrespec- the classifier learning algorithm used. In
tively of whether the notion of phrase is the case of nonbinary indexing, for deter-
motivated mining the weight wkj of term tk in docu-
syntactically, that is, the phrase is such ment d j any IR-style indexing technique
according to a grammar of the language that represents a document as a vector of
(see Lewis [1992a]); or weighted terms may be used. Most of the
times, the standard tfidf function is used
statistically, that is, the phrase is (see Salton and Buckley [1988]), defined as
not grammatically such, but is com-
posed of a set/sequence of words whose
patterns of contiguous occurrence in the |Tr|
tfidf (tk , d j ) = #(tk , d j ) log , (1)
collection are statistically significant #Tr (tk )
(see Caropreso et al. [2001]).
Lewis [1992a] argued that the likely rea- where #(tk , d j ) denotes the number of
son for the discouraging results is that, times tk occurs in d j , and #Tr (tk ) denotes
although indexing languages based on the document frequency of term tk , that
phrases have superior semantic qualities, is, the number of documents in Tr in
they have inferior statistical qualities which tk occurs. This function embodies
with respect to word-only indexing lan- the intuitions that (i) the more often a
guages: a phrase-only indexing language term occurs in a document, the more it
has more terms, more synonymous or is representative of its content, and (ii)
nearly synonymous terms, lower consis- the more documents a term occurs in,
tency of assignment (since synonymous the less discriminating it is.5 Note that
terms are not assigned to the same docu- this formula (as most other indexing
ments), and lower document frequency for formulae) weights the importance of a
terms [Lewis 1992a, page 40]. Although term to a document in terms of occurrence
his remarks are about syntactically moti- considerations only, thereby deeming of
vated phrases, they also apply to statisti- null importance the order in which the
cally motivated ones, although perhaps to terms occur in the document and the syn-
a smaller degree. A combination of the two tactic role they play. In other words, the
approaches is probably the best way to semantics of a document is reduced to the
go: Tzeras and Hartmann [1993] obtained collective lexical semantics of the terms
significant improvements by using noun that occur in it, thereby disregarding the
phrases obtained through a combination issue of compositional semantics (an ex-
of syntactic and statistical criteria, where ception are the representation techniques
a crude syntactic method was comple- used for FOIL [Cohen 1995a] and SLEEPING
mented by a statistical filter (only those EXPERTS [Cohen and Singer 1999]).
syntactic phrases that occurred at least In order for the weights to fall in the
three times in the positive examples of a [0,1] interval and for the documents to
category ci were retained). It is likely that be represented by vectors of equal length,
the final word on the usefulness of phrase the weights resulting from tfidf are often
indexing in TC has still to be told, and
5 There exist many variants of tfidf, that differ from
investigations in this direction are still
being actively pursued [Caropreso et al. each other in terms of logarithms, normalization or
other correction factors. Formula 1 is just one of
2001; Mladenic and Grobelnik 1998]. the possible instances of this class; see Salton and
As for issue (2), weights usually Buckley [1988] and Singhal et al. [1996] for varia-
range between 0 and 1 (an exception is tions on this theme.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


12 Sebastiani

normalized by cosine normalization, given the claims of novelty of the described in-
by vention. This approach was made possi-
ble by the fact that documents describing
tfidf (tk , d j ) patents are structured. Similarly, when a
wkj = qP . (2) document title is available, one can pay
|T | 2
s=1 (tfidf (ts , d j )) extra importance to the words it contains
[Apte et al. 1994; Cohen and Singer 1999;
Weiss et al. 1999]. When documents are
Although normalized tfidf is the most
flat, identifying the most relevant part of
popular one, other indexing functions
a document is instead a nonobvious task.
have also been used, including proba-
bilistic techniques [Govert et al. 1999] or
techniques for indexing structured docu- 5.2. The Darmstadt Indexing Approach
ments [Larkey and Croft 1996]. Functions
different from tfidf are especially needed The AIR/X system [Fuhr et al. 1991] oc-
when Tr is not available in its entirety cupies a special place in the literature on
from the start and #Tr (tk ) cannot thus be indexing for TC. This system is the final
computed, as in adaptive filtering; in this result of the AIR project, one of the most
case, approximations of tfidf are usually important efforts in the history of TC:
employed [Dagan et al. 1997, Section 4.3]. spanning a duration of more than 10 years
Before indexing, the removal of function [Knorz 1982; Tzeras and Hartmann 1993],
words (i.e., topic-neutral words such as ar- it has produced a system operatively em-
ticles, prepositions, conjunctions, etc.) is ployed since 1985 in the classification of
almost always performed (exceptions in- corpora of scientific literature of O(105 )
clude Lewis et al. [1996], Nigam et al. documents and O(104 ) categories, and has
[2000], and Riloff [1995]).6 Concerning had important theoretical spin-offs in the
stemming (i.e., grouping words that share field of probabilistic indexing [Fuhr 1989;
the same morphological root), its suitabil- Fuhr and Buckely 1991].7
ity to TC is controversial. Although, simi- The approach to indexing taken in
larly to unsupervised term clustering (see AIR/X is known as the Darmstadt In-
Section 5.5.1) of which it is an instance, dexing Approach (DIA) [Fuhr 1985].
stemming has sometimes been reported Here, indexing is used in the sense of
to hurt effectiveness (e.g., Baker and Section 3.1, that is, as using terms from
McCallum [1998]), the recent tendency is a controlled vocabulary, and is thus a
to adopt it, as it reduces both the dimen- synonym of TC (the DIA was later ex-
sionality of the term space (see Section 5.3) tended to indexing with free terms [Fuhr
and the stochastic dependence between and Buckley 1991]). The idea that under-
terms (see Section 6.2). lies the DIA is the use of a much wider
Depending on the application, either set of features than described in Sec-
the full text of the document or selected tion 5.1. All other approaches mentioned
parts of it are indexed. While the former in this paper view terms as the dimen-
option is the rule, exceptions exist. For sions of the learning space, where terms
instance, in a patent categorization ap- may be single words, stems, phrases, or
plication Larkey [1999] indexed only the (see Sections 5.5.1 and 5.5.2) combina-
title, the abstract, the first 20 lines of tions of any of these. In contrast, the DIA
the summary, and the section containing considers properties (of terms, documents,

6 One application of TC in which it would be inap- 7 The AIR/X system, its applications (including the
propriate to remove function words is author identi- AIR/PHYS system [Biebricher et al. 1988], an appli-
fication for documents of disputed paternity. In fact, cation of AIR/X to indexing physics literature), and
as noted in Manning and Schutze [1999], page 589, its experiments have also been richly documented
it is often the little words that give an author away in a series of papers and doctoral theses written in
(for example, the relative frequencies of words like German. The interested reader may consult Fuhr
because or though). et al. [1991] for a detailed bibliography.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 13

categories, or pairwise relationships am- been taken up by other researchers so


ong these) as basic dimensions of the far. For new TC applications dealing with
learning space. Examples of these are structured documents or categorization of
Web pages, these ideas may become of in-
properties of a term tk : e.g. the idf of tk ; creasing importance.
properties of the relationship between a
term tk and a document d j : for example,
5.3. Dimensionality Reduction
the t f of tk in d j ; or the location (e.g., in
the title, or in the abstract) of tk within Unlike in text retrieval, in TC the high
dj; dimensionality of the term space (i.e.,
properties of a document d j : for exam- the large value of |T |) may be problem-
ple, the length of d j ; atic. In fact, while typical algorithms used
properties of a category ci : for example, in text retrieval (such as cosine match-
the training set generality of ci . ing) can scale to high values of |T |, the
same does not hold of many sophisticated
For each possible document-category pair, learning algorithms used for classifier in-
the values of these features are collected duction (e.g., the LLSF algorithm of Yang
in a so-called relevance description vec- and Chute [1994]). Because of this, be-
E j , ci ). The size of this vector is
tor rd(d fore classifier induction one often applies
determined by the number of properties a pass of dimensionality reduction (DR),
considered, and is thus independent of whose effect is to reduce the size of the
specific terms, categories, or documents vector space from |T | to |T 0 | |T |; the set
(for multivalued features, appropriate ag- T 0 is called the reduced term set.
gregation functions are applied in order DR is also beneficial since it tends to re-
to yield a single value to be included in duce overfitting, that is, the phenomenon
E j , ci )); in this way an abstraction from
rd(d by which a classifier is tuned also to
specific terms, categories, or documents is the contingent characteristics of the train-
achieved. ing data rather than just the constitu-
The main advantage of this approach tive characteristics of the categories. Clas-
is the possibility to consider additional sifiers that overfit the training data are
features that can hardly be accounted for good at reclassifying the data they have
in the usual term-based approaches, for been trained on, but much worse at clas-
example, the location of a term within a sifying previously unseen data. Experi-
document, or the certainty with which a ments have shown that, in order to avoid
phrase was identified in a document. The overfitting a number of training exam-
term-category relationship is described by ples roughly proportional to the number
estimates, derived from the training set, of of terms used is needed; Fuhr and Buckley
the probability P (ci | tk ) that a document [1991, page 235] have suggested that 50
belongs to category ci , given that it con- 100 training examples per term may be
tains term tk (the DIA association factor).8 needed in TC tasks. This means that, if DR
Relevance description vectors rd E (d j , ci ) is performed, overfitting may be avoided
are then the final representations that even if a smaller amount of training exam-
are used for the classification of document ples is used. However, in removing terms
d j under category ci . the risk is to remove potentially useful
The essential ideas of the DIA information on the meaning of the docu-
transforming the classification space by ments. It is then clear that, in order to
means of abstraction and using a more de- obtain optimal (cost-)effectiveness, the re-
tailed text representation than the stan- duction process must be performed with
dard bag-of-words approachhave not care. Various DR methods have been pro-
posed, either from the information theory
8 Association factors are called adhesion coefficients or from the linear algebra literature, and
in many early papers on TC; see Field [1975]; their relative merits have been tested by
Robertson and Harding [1984]. experimentally evaluating the variation

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


14 Sebastiani

in effectiveness that a given classifier 5.4. Dimensionality Reduction


undergoes after application of the function by Term Selection
to the term space.
Given a predetermined integer r, tech-
There are two distinct ways of view-
niques for term selection (also called term
ing DR, depending on whether the task is
space reductionTSR) attempt to select,
performed locally (i.e., for each individual
from the original set T , the set T 0 of
category) or globally: terms (with |T 0 | |T |) that, when used
local DR: for each category ci , a set Ti0 of for document indexing, yields the highest
terms, with |Ti0 | |T |, is chosen for clas- effectiveness. Yang and Pedersen [1997]
sification under ci (see Apte et al. [1994]; have shown that TSR may even result in
Lewis and Ringuette [1994]; Li and a moderate (5%) increase in effective-
Jain [1998]; Ng et al. [1997]; Sable and ness, depending on the classifier, on the
Hatzivassiloglou [2000]; Schutze et al. aggressivity |T|T 0|| of the reduction, and on
[1995], Wiener et al. [1995]). This means the TSR technique used.
Moulinier et al. [1996] have used a so-
that different subsets of dE j are used
when working with the different cate- called wrapper approach, that is, one in
gories. Typical values are 10 |Ti0 | 50. which T 0 is identified by means of the
same learning method that will be used for
global DR: a set T 0 of terms, with building the classifier [John et al. 1994].
|T 0 | |T |, is chosen for the classifica- Starting from an initial term set, a new
tion under all categories C = {c1 , . . . , c|C| } term set is generated by either adding
(see Caropreso et al. [2001]; Mladenic or removing a term. When a new term
[1998]; Yang [1999]; Yang and Pedersen set is generated, a classifier based on it
[1997]). is built and then tested on a validation
set. The term set that results in the best
This distinction usually does not impact
effectiveness is chosen. This approach has
on the choice of DR technique, since
the advantage of being tuned to the learn-
most such techniques can be used (and
ing algorithm being used; moreover, if lo-
have been used) for local and global
cal DR is performed, different numbers of
DR alike (supervised DR techniquessee
terms for different categories may be cho-
Section 5.5.1are exceptions to this rule).
sen, depending on whether a category is
In the rest of this section, we will assume
or is not easily separable from the others.
that the global approach is used, although
However, the sheer size of the space of dif-
everything we will say also applies to the
ferent term sets makes its cost-prohibitive
local approach.
for standard TC applications.
A second, orthogonal distinction may be
A computationally easier alternative is
drawn in terms of the nature of the result-
the filtering approach [John et al. 1994],
ing terms:
that is, keeping the |T 0 | |T | terms that
DR by term selection: T 0 is a subset receive the highest score according to a
of T ; function that measures the importance
of the term for the TC task. We will explore
DR by term extraction: the terms in this solution in the rest of this section.
T 0 are not of the same type of the
terms in T (e.g., if the terms in T are
words, the terms in T 0 may not be words 5.4.1. Document Frequency. A simple and
at all), but are obtained by combina- effective global TSR function is the docu-
tions or transformations of the original ment frequency #Tr (tk ) of a term tk , that is,
ones. only the terms that occur in the highest
number of documents are retained. In a
Unlike in the previous distinction, these series of experiments Yang and Pedersen
two ways of doing DR are tackled by very [1997] have shown that with #Tr (tk ) it is
different techniques; we will address them possible to reduce the dimensionality by a
separately in the next two sections. factor of 10 with no loss in effectiveness (a

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 15

reduction by a factor of 100 bringing about 1996; Yang and Pedersen 1997, Yang and
just a small loss). Liu 1999], mutual information [Dumais
This seems to indicate that the terms et al. 1998; Lam et al. 1997; Larkey
occurring most frequently in the collection and Croft 1996; Lewis and Ringuette
are the most valuable for TC. As such, this 1994; Li and Jain 1998; Moulinier et al.
would seem to contradict a well-known 1996; Ruiz and Srinivasan 1999; Taira
law of IR, according to which the terms and Haruno 1999; Yang and Pedersen
with low-to-medium document frequency 1997], odds ratio [Caropreso et al. 2001;
are the most informative ones [Salton and Mladenic 1998; Ruiz and Srinivasan
Buckley 1988]. But these two results do 1999], relevancy score [Wiener et al.
not contradict each other, since it is well 1995], and GSS coefficient [Galavotti
known (see Salton et al. [1975]) that the et al. 2000]. The mathematical definitions
large majority of the words occurring in of these measures are summarized for
a corpus have a very low document fre- convenience in Table I.9 Here, probabil-
quency; this means that by reducing the ities are interpreted on an event space
term set by a factor of 10 using document of documents (e.g., P (tk , ci ) denotes the
frequency, only such words are removed, probability that, for a random document
while the words from low-to-medium to x, term tk does not occur in x and x
high document frequency are preserved. belongs to category ci ), and are estimated
Of course, stop words need to be removed by counting occurrences in the training
in advance, lest only topic-neutral words set. All functions are specified locally to
are retained [Mladenic 1998]. a specific category ci ; in order to assess the
Finally, note that a slightly more empir- value of a term tk in a global, category-
ical form of TSR by document frequency independent P|C| sense, either the sum
is adopted by many authors, who remove f sum (tk ) = i=1 fP(tk , ci ), or the weighted
|C|
all terms occurring in at most x train- sum f wsum (tk ) = i=1 P (ci ) f (tk , ci ), or the
ing documents (popular values for x range |C|
maximum f max (tk ) = maxi=1 f (tk , ci ) of
from 1 to 3), either as the only form of DR their category-specific values f (tk , ci ) are
[Maron 1961; Ittner et al. 1995] or before usually computed.
applying another more sophisticated form These functions try to capture the in-
[Dumais et al. 1998; Li and Jain 1998]. A tuition that the best terms for ci are the
variant of this policy is removing all terms ones distributed most differently in the
that occur at most x times in the train- sets of positive and negative examples of
ing set (e.g., Dagan et al. [1997]; Joachims ci . However, interpretations of this prin-
[1997]), with popular values for x rang- ciple vary across different functions. For
ing from 1 (e.g., Baker and McCallum instance, in the experimental sciences 2
[1998]) to 5 (e.g., Apte et al. [1994]; Cohen is used to measure how the results of an
[1995a]). observation differ (i.e., are independent)
from the results expected according to an
5.4.2. Other Information-Theoretic Term initial hypothesis (lower values indicate
Selection Functions. Other more sophis- lower dependence). In DR we measure how
ticated information-theoretic functions independent tk and ci are. The terms tk
have been used in the literature, among
them the DIA association factor [Fuhr
9 For better uniformity Table I views all the TSR
et al. 1991], chi-square [Caropreso et al.
functions of this section in terms of subjective proba-
2001; Galavotti et al. 2000; Schutze et al. bility. In some cases such as 2 (tk , ci ) this is slightly
1995; Sebastiani et al. 2000; Yang and artificial, since this function is not usually viewed in
Pedersen 1997; Yang and Liu 1999], probabilistic terms. The formulae refer to the local
NGL coefficient [Ng et al. 1997; Ruiz (i.e., category-specific) forms of the functions, which
and Srinivasan 1999], information gain again is slightly artificial in some cases. Note that
the NGL and GSS coefficients are here named after
[Caropreso et al. 2001; Larkey 1998; their authors, since they had originally been given
Lewis 1992a; Lewis and Ringuette 1994; names that might generate some confusion if used
Mladenic 1998; Moulinier and Ganascia here.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


16 Sebastiani

Table I. Main Functions Used for Term Space Reduction Purposes. (Information gain is also known as
expected mutual information, and is used under this name by Lewis [1992a, page 44] and
Larkey [1998]. In the RS (t k , c i ) formula, d is a constant damping factor.)
Function Denoted by Mathematical form

DIA association factor z(tk , ci ) P (ci | tk )

X X P (t, c)
Information gain IG(tk , ci ) P (t, c) log
P (t) P (c)
c{ci , ci } t{tk , tk }
P (tk , ci )
Mutual information MI(tk , ci ) log
P (tk ) P (ci )
|Tr| [P (tk , ci ) P (tk , ci ) P (tk , ci ) P (tk , ci )]2
Chi-square 2 (tk , ci )
P (tk ) P (tk ) P (ci ) P (ci )
p
|Tr| [P (tk , ci ) P (tk , ci ) P (tk , ci ) P (tk , ci )]
NGL coefficient NGL(tk , ci ) p
P (tk ) P (tk ) P (ci ) P (ci )
P (tk | ci ) + d
Relevancy score RS(tk , ci ) log
P (tk | ci ) + d
P (tk | ci ) (1 P (tk | ci ))
Odds ratio OR(tk , ci )
(1 P (tk | ci )) P (tk | ci )

GSS coefficient GSS(tk , ci ) P (tk , ci ) P (tk , ci ) P (tk , ci ) P (tk , ci )

with the lowest value for 2 (tk , ci ) are thus However, it should be noted that these
the most independent from ci ; since we results are just indicative, and that more
are interested in the terms which are not, general statements on the relative mer-
we select the terms for which 2 (tk , ci ) is its of these functions could be made only
highest. as a result of comparative experiments
While each TSR function has its own performed in thoroughly controlled condi-
rationale, the ultimate word on its value tions and on a variety of different situ-
is the effectiveness it brings about. Var- ations (e.g., different classifiers, different
ious experimental comparisons of TSR initial corpora, . . . ).
functions have thus been carried out
[Caropreso et al. 2001; Galavotti et al.
5.5. Dimensionality Reduction
2000; Mladenic 1998; Yang and Pedersen
by Term Extraction
1997]. In these experiments most func-
tions listed in Table I (with the possible Given a predetermined |T 0 | |T |, term ex-
exception of MI) have improved on the re- traction attempts to generate, from the
sults of document frequency. For instance, original set T , a set T 0 of synthetic
Yang and Pedersen [1997] have shown terms that maximize effectiveness. The
that, with various classifiers and various rationale for using synthetic (rather than
initial corpora, sophisticated techniques naturally occurring) terms is that, due
such as IGsum (tk , ci ) or max
2
(tk , ci ) can re- to the pervasive problems of polysemy,
duce the dimensionality of the term space homonymy, and synonymy, the original
by a factor of 100 with no loss (or even terms may not be optimal dimensions
with a small increase) of effectiveness. for document content representation.
Collectively, the experiments reported in Methods for term extraction try to solve
the above-mentioned papers seem to in- these problems by creating artificial terms
dicate that {ORsum , NGLsum , GSSmax } > that do not suffer from them. Any term ex-
{max
2
, IGsum } > {wavg
2
} {MImax , MIwsum }, traction method consists in (i) a method
where > means performs better than. for extracting the new terms from the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 17

old ones, and (ii) a method for convert- ments. Baker and McCallum [1998] pro-
ing the original document representa- vided instead an example of supervised
tions into new representations based on clustering, as the distributional clustering
the newly synthesized dimensions. Two method they employed clusters together
term extraction methods have been exper- those terms that tend to indicate the pres-
imented with in TC, namely term cluster- ence of the same category, or group of cat-
ing and latent semantic indexing. egories. Their experiments, carried out in
the context of a Nave Bayes classifier (see
Section 6.2), showed only a 2% effective-
5.5.1. Term Clustering. Term clustering ness loss with an aggressivity of 1,000,
tries to group words with a high degree of and even showed some effectiveness im-
pairwise semantic relatedness, so that the provement with less aggressive levels of
groups (or their centroids, or a represen- reduction. Later experiments by Slonim
tative of them) may be used instead of the and Tishby [2001] have confirmed the po-
terms as dimensions of the vector space. tential of supervised clustering methods
Term clustering is different from term se- for term extraction.
lection, since the former tends to address
terms synonymous (or near-synonymous)
with other terms, while the latter targets 5.5.2. Latent Semantic Indexing. Latent se-
noninformative terms.10 mantic indexing (LSI[Deerwester et al.
Lewis [1992a] was the first to inves- 1990]) is a DR technique developed in IR
tigate the use of term clustering in TC. in order to address the problems deriv-
The method he employed, called recipro- ing from the use of synonymous, near-
cal nearest neighbor clustering, consists synonymous, and polysemous words as
in creating clusters of two terms that are dimensions of document representations.
one the most similar to the other accord- This technique compresses document vec-
ing to some measure of similarity. His re- tors into vectors of a lower-dimensional
sults were inferior to those obtained by space whose dimensions are obtained
single-word indexing, possibly due to a dis- as combinations of the original dimen-
appointing performance by the clustering sions by looking at their patterns of co-
method: as Lewis [1992a, page 48] said, occurrence. In practice, LSI infers the
The relationships captured in the clus- dependence among the original terms
ters are mostly accidental, rather than the from a corpus and wires this dependence
systematic relationships that were hoped into the newly obtained, independent di-
for. mensions. The function mapping original
Li and Jain [1998] viewed semantic vectors into new vectors is obtained by ap-
relatedness between words in terms of plying a singular value decomposition to
their co-occurrence and co-absence within the matrix formed by the original docu-
training documents. By using this tech- ment vectors. In TC this technique is ap-
nique in the context of a hierarchical plied by deriving the mapping function
clustering algorithm, they witnessed only from the training set and then applying
a marginal effectiveness improvement; it to training and test documents alike.
however, the small size of their experiment One characteristic of LSI is that the
(see Section 6.11) hardly allows any defini- newly obtained dimensions are not, unlike
tive conclusion to be reached. in term selection and term clustering,
Both Lewis [1992a] and Li and Jain intuitively interpretable. However, they
[1998] are examples of unsupervised clus- work well in bringing out the latent
tering, since clustering is not affected by semantic structure of the vocabulary
the category labels attached to the docu- used in the corpus. For instance, Schutze
et al. [1995, page 235] discussed the clas-
sification under category Demographic
10Some term selection methods, such as wrapper shifts in the U.S. with economic impact of
methods, also address the problem of redundancy. a document that was indeed a positive

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


18 Sebastiani

test instance for the category, and that


Schutze [1998], Weigend et al. [1999], and
contained, among others, the quite reveal- Yang [1995].
ing sentence The nation grew to 249.6
million people in the 1980s as more
Americans left the industrial and ag-
6. INDUCTIVE CONSTRUCTION
ricultural heartlands for the South
OF TEXT CLASSIFIERS
and West. The classifier decision was in-
correct when local DR had been performed The inductive construction of text clas-
by 2 -based term selection retaining the sifiers has been tackled in a variety of
top original 200 terms, but was correct ways. Here we will deal only with the
when the same task was tackled by methods that have been most popular
means of LSI. This well exemplifies in TC, but we will also briefly mention
how LSI works: the above sentence does the existence of alternative, less standard
not contain any of the 200 terms most approaches.
relevant to the category selected by 2 , We start by discussing the general
but quite possibly the words contained in form that a text classifier has. Let us
it have concurred to produce one or more recall from Section 2.4 that there are
of the LSI higher-order terms that gener- two alternative ways of viewing classi-
ate the document space of the category. fication: hard (fully automated) clas-
As Schutze et al. [1995, page 230] put it, sification and ranking (semiautomated)
if there is a great number of terms which classification.
all contribute a small amount of critical The inductive construction of a ranking
information, then the combination of evi- classifier for category ci C usually con-
dence is a major problem for a term-based sists in the definition of a function CSVi :
classifier. A drawback of LSI, though, is D [0, 1] that, given a document d j , re-
that if some original term is particularly turns a categorization status value for it,
good in itself at discriminating a category, that is, a number between 0 and 1 which,
that discrimination power may be lost in roughly speaking, represents the evidence
the new vector space. for the fact that d j ci . Documents are
Wiener et al. [1995] used LSI in two then ranked according to their CSVi value.
alternative ways: (i) for local DR, thus This works for document-ranking TC;
creating several category-specific LSI category-ranking TC is usually tackled
representations, and (ii) for global DR, by ranking, for a given document d j , its
thus creating a single LSI representa- CSVi scores for the different categories in
tion for the entire category set. Their C = {c1 , . . . , c|C| }.
experiments showed the former approach The CSVi function takes up differ-
to perform better than the latter, and ent meanings according to the learn-
both approaches to perform better than ing method used: for instance, in the
simple TSR based on Relevancy Score Nave Bayes approach of Section 6.2
(see Table I). CSVi (d j ) is defined in terms of a proba-

Schutze et al. [1995] experimentally bility, whereas in the Rocchio approach
compared LSI-based term extraction with discussed in Section 6.7 CSVi (d j ) is a mea-
2 -based TSR using three different clas- sure of vector closeness in |T |-dimensional
sifier learning techniques (namely, linear space.
discriminant analysis, logistic regression, The construction of a hard classi-
and neural networks). Their experiments fier may follow two alternative paths.
showed LSI to be far more effective than The former consists in the definition of
2 for the first two techniques, while both a function CSVi : D {T, F }. The lat-
methods performed equally well for the ter consists instead in the definition of
neural network classifier. a function CSVi : D [0, 1], analogous
For other TC works that have used to the one used for ranking classification,
LSI or similar term extraction techniques, followed by the definition of a threshold
see Hull [1994], Li and Jain [1998], i such that CSVi (d j ) i is interpreted

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 19

as T while CSVi (d j ) < i is interpreted [Cohen and Singer 1999; Schapire et al.
as F .11 1998; Wiener et al. 1995]; it is also called
The definition of thresholds will be the Scut in Yang [1999]. Different i s are typ-
topic of Section 6.1. In Sections 6.2 to 6.12 ically chosen for the different ci s.
we will instead concentrate on the defini- A second, popular experimental pol-
tion of CSVi , discussing a number of ap- icy is proportional thresholding [Iwayama
proaches that have been applied in the TC and Tokunaga 1995; Larkey 1998; Lewis
literature. In general we will assume we 1992a; Lewis and Ringuette 1994; Wiener
are dealing with hard classification; it et al. 1995], also called Pcut in Yang
will be evident from the context how and [1999]. This policy consists in choosing
whether the approaches can be adapted to the value of i for which g Va (ci ) is clos-
ranking classification. The presentation of est to g Tr (ci ), and embodies the principle
the algorithms will be mostly qualitative that the same percentage of documents of
rather than quantitative, that is, will fo- both training and test set should be clas-
cus on the methods for classifier learning sified under ci . For obvious reasons, this
rather than on the effectiveness and ef- policy does not lend itself to document-
ficiency of the classifiers built by means pivoted TC.
of them; this will instead be the focus of Sometimes, depending on the applica-
Section 7. tion, a fixed thresholding policy (a.k.a.
k-per-doc thresholding [Lewis 1992a] or
Rcut [Yang 1999]) is applied, whereby it is
6.1. Determining Thresholds stipulated that a fixed number k of cate-
gories, equal for all d j s, are to be assigned
There are various policies for determin-
to each document d j . This is often used,
ing the threshold i , also depending on the
for instance, in applications of TC to au-
constraints imposed by the application.
tomated document indexing [Field 1975;
The most important distinction is whether
Lam et al. 1999]. Strictly speaking, how-
the threshold is derived analytically or
ever, this is not a thresholding policy in the
experimentally.
sense defined at the beginning of Section 6,
The former method is possible only in
as it might happen that d 0 is classified un-
the presence of a theoretical result that in-
der ci , d 00 is not, and CSVi (d 0 ) < CSVi (d 00 ).
dicates how to compute the threshold that
Quite clearly, this policy is mostly at home
maximizes the expected value of the ef-
with document-pivoted TC. However, it
fectiveness function [Lewis 1995a]. This is
suffers from a certain coarseness, as the
typical of classifiers that output probabil-
fact that k is equal for all documents (nor
ity estimates of the membership of d j in ci
could this be otherwise) allows no fine-
(see Section 6.2) and whose effectiveness is
tuning.
computed by decision-theoretic measures
In his experiments Lewis [1992a] found
such as utility (see Section 7.1.3); we thus
the proportional policy to be superior to
defer the discussion of this policy (which
probability thresholding when microaver-
is called probability thresholding in Lewis
aged effectiveness was tested but slightly
[1995a]) to Section 7.1.3.
inferior with macroaveraging (see Section
When such a theoretical result is not
7.1.1). Yang [1999] found instead CSV
known, one has to revert to the latter
thresholding to be superior to proportional
method, which consists in testing different
thresholding (possibly due to her category-
values for i on a validation set and choos-
specific optimization on a validation set),
ing the value which maximizes effective-
and found fixed thresholding to be con-
ness. We call this policy CSV thresholding
sistently inferior to the other two poli-
cies. The fact that these latter results have
11
been obtained across different classifiers
Alternative methods are possible, such as train-
ing a classifier for which some standard, predefined
no doubt reinforces them.
value such as 0 is the threshold. For ease of exposi- In general, aside from the considera-
tion we will not discuss them. tions above, the choice of the thresholding

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


20 Sebastiani

policy may also be influenced by the Sahami [1997]; Larkey and Croft [1996];
application; for instance, in applying a Lewis [1992a]; Lewis and Gale [1994];
text classifier to document indexing for Li and Jain [1998]; Robertson and
Boolean systems a fixed thresholding pol- Harding [1984]). The nave character of
icy might be chosen, while a proportional the classifier is due to the fact that usu-
or CSV thresholding method might be cho- ally this assumption is, quite obviously,
sen for Web page classification under hier- not verified in practice.
archical catalogues. One of the best-known Nave Bayes ap-
proaches is the binary independence clas-
sifier [Robertson and Sparck Jones 1976],
6.2. Probabilistic Classifiers
which results from using binary-valued
Probabilistic classifiers (see Lewis [1998] vector representations for documents. In
for a thorough discussion) view CSVi (d j ) this case, if we write pki as short for
in terms of P (ci |dEj ), that is, the proba- P (wkx = 1 | ci ), the P (wk j | ci ) factors of
bility that a document represented by a (4) may be written as
vector dEj = hw1 j , . . . , w|T | j i of (binary or
weighted) terms belongs to ci , and com- w
P (wk j | ci ) = pkik j (1 pki )1wk j
pute this probability by an application of wk j
Bayes theorem, given by pki
= (1 pki ). (5)
1 pki
P (ci )P (dEj | ci )
P (ci | dEj ) = . (3)
P (dEj ) We may further observe that in TC the
document space is partitioned into two
In (3) the event space is the space of docu- categories,12 ci and its complement c i , such
ments: P (dEj ) is thus the probability that a that P (ci | dEj ) = 1 P (ci | dEj ). If we plug
in (4) and (5) into (3) and take logs we
randomly picked document has vector dEj
obtain
as its representation, and P (ci ) the prob-
ability that a randomly picked document
belongs to ci . log P (ci | dEj )
The estimation of P (dEj | ci ) in (3) is |T |
X pki
problematic, since the number of possible = log P (ci ) + wk j log
vectors dEj is too high (the same holds for 1 pki
k=1
P (dEj ), but for reasons that will be clear |T |
X
shortly this will not concern us). In or- + log(1 pki ) log P (dEj ) (6)
der to alleviate this problem it is com- k=1
mon to make the assumption that any two log(1 P (ci | dEj ))
coordinates of the document vector are,
|T |
X
when viewed as random variables, statis- pki
tically independent of each other; this in- = log(1 P (ci )) + wk j log
1 pki
dependence assumption is encoded by the k=1
equation |T |
X
+ log(1 pki) log P (dEj ), (7)
|T |
Y k=1
P (dEj | ci ) = P (wk j | ci ). (4)
k=1
12 Cooper [1995] has pointed out that in this case
Probabilistic classifiers that use this as- the full independence assumption of (4) is not ac-
sumption are called Nave Bayes clas- tually made in the Nave Bayes classifier; the as-
sifiers, and account for most of the sumption needed here is instead the weaker linked
dependence assumption, which may be written as
probabilistic approaches to TC in the lit- P (dEj | ci ) Q|T | P (wk j | ci )
erature (see Joachims [1998]; Koller and = .
P (dEj | c i ) k=1 P (wk j | c i )

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 21

where we write pki as short for looks natural, given that weighted in-
P (wkx = 1 | c i ). We may convert (6) and (7) dexing techniques (see Fuhr [1989];
into a single equation by subtracting com- Salton and Buckley [1988]) accounting
ponentwise (7) from (6), thus obtaining for the importance of tk for d j play a
key role in IR.
P (ci | dEj ) to introduce document length normal-
log P (c | dE )
1 P (ci | dEj ) ization. The value of log 1P (ci | djE ) tends
i j

|T |
to be more extreme (i.e., very high
P (ci ) X pki (1 pki ) or very low) for long documents (i.e.,
= log + wk j log documents such that wk j = 1 for many
1 P (ci ) pki (1 pki )
k=1 values of k), irrespectively of their
|T |
X semantic relatedness to ci , thus call-
1 pki
+ log . (8) ing for length normalization. Taking
1 pki length into account is easy in non-
k=1
probabilistic approaches to classifica-
P (ci | dEj ) tion (see Section 6.7), but is problematic
Note that 1P (ci | dEj )
is an increasing mono-
in probabilistic ones (see Lewis [1998],
tonic function of P (ci | dEj ), and may thus Section 5). One possible answer is to
be used directly as CSV P|Ti (d j ). Note also switch from an interpretation of Nave
P (ci ) | 1 pki
that log 1P (ci )
and k=1 log 1 pki
are Bayes in which documents are events to
constant for all documents, and may one in which terms are events [Baker
thus be disregarded.13 Defining a clas- and McCallum 1998; McCallum et al.
sifier for category ci thus basically re- 1998; Chakrabarti et al. 1998a; Guthrie
quires estimating the 2|T | parameters et al. 1994]. This accounts for document
{ p1i , p1i , . . . , p|T |i , p|T |i } from the training length naturally but, as noted by Lewis
data, which may be done in the obvious [1998], has the drawback that differ-
way. Note that in general the classifica- ent occurrences of the same word within
tion of a given document does not re- the same document are viewed as in-
quire one to computeP a sum of |T | factors, dependent, an assumption even more
|T | pki (1 pki ) implausible than (4).
as the presence of k=1 wk j log pki (1 pki )
would imply; in fact, all the factors for to relax the independence assumption.
which wk j = 0 may be disregarded, and This may be the hardest route to follow,
this accounts for the vast majority of them, since this produces classifiers of higher
since document vectors are usually very computational cost and characterized
sparse. by harder parameter estimation prob-
The method we have illustrated is just lems [Koller and Sahami 1997]. Earlier
one of the many variants of the Nave efforts in this direction within proba-
Bayes approach, the common denomina- bilistic text search (e.g., vanRijsbergen
tor of which is (4). A recent paper by Lewis [1977]) have not shown the perfor-
[1998] is an excellent roadmap on the mance improvements that were hoped
various directions that research on Nave for. Recently, the fact that the binary in-
Bayes classifiers has taken; among these dependence assumption seldom harms
are the ones aiming effectiveness has also been given some
theoretical justification [Domingos and
to relax the constraint that document Pazzani 1997].
vectors should be binary-valued. This
The quotation of text search in the last
13 This is not true, however, if the fixed threshold- paragraph is not casual. Unlike other
ing method of Section 6.1 is adopted. In fact, for a types of classifiers, the literature on prob-
fixed document d j the first and third factor in the for-
mula above are different for different categories, and abilistic classifiers is inextricably inter-
may therefore influence the choice of the categories twined with that on probabilistic search
under which to file d j . systems (see Crestani et al. [1998] for a

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


22 Sebastiani

Fig. 2. A decision tree equivalent to the DNF rule of Figure 1. Edges are labeled
by terms and leaves are labeled by categories (underlining denotes negation).

review), since these latter attempt to de- the weights that the terms labeling the in-
termine the probability that a document ternal nodes have in vector dE j , until a leaf
falls in the category denoted by the query, node is reached; the label of this node is
and since they are the only search systems then assigned to d j . Most such classifiers
that take relevance feedback, a notion es- use binary document representations, and
sentially involving supervised learning, as thus consist of binary trees. An example
central. DT is illustrated in Figure 2.
There are a number of standard pack-
ages for DT learning, and most DT ap-
6.3. Decision Tree Classifiers
proaches to TC have made use of one such
Probabilistic methods are quantitative package. Among the most popular ones are
(i.e., numeric) in nature, and as such ID3 (used by Fuhr et al. [1991]), C4.5 (used
have sometimes been criticized since, ef- by Cohen and Hirsh [1998], Cohen and
fective as they may be, they are not eas- Singer [1999], Joachims [1998], and Lewis
ily interpretable by humans. A class of and Catlett [1994]), and C5 (used by Li
algorithms that do not suffer from this and Jain [1998]). TC efforts based on ex-
problem are symbolic (i.e., nonnumeric) perimental DT packages include Dumais
algorithms, among which inductive rule et al. [1998], Lewis and Ringuette [1994],
learners (which we will discuss in Sec- and Weiss et al. [1999].
tion 6.4) and decision tree learners are the A possible method for learning a DT
most important examples. for category ci consists in a divide and
A decision tree (DT) text classifier (see conquer strategy of (i) checking whether
Mitchell [1996], Chapter 3) is a tree in all the training examples have the same
which internal nodes are labeled by terms, label (either ci or c i ); (ii) if not, select-
branches departing from them are labeled ing a term tk , partitioning Tr into classes
by tests on the weight that the term has in of documents that have the same value
the test document, and leafs are labeled by for tk , and placing each such class in a
categories. Such a classifier categorizes a separate subtree. The process is recur-
test document d j by recursively testing for sively repeated on the subtrees until each

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 23

leaf of the tree so generated contains train- according to some minimality criterion.
ing examples assigned to the same cate- While DTs are typically built by a top-
gory ci , which is then chosen as the label down, divide-and-conquer strategy, DNF
for the leaf. The key step is the choice of rules are often built in a bottom-up fash-
the term tk on which to operate the parti- ion. Initially, every training example d j is
tion, a choice which is generally made ac- viewed as a clause 1 , . . . , n i , where
cording to an information gain or entropy 1 , . . . , n are the terms contained in d j
criterion. However, such a fully grown and i equals ci or c i according to whether
tree may be prone to overfitting, as some d j is a positive or negative example of ci .
branches may be too specific to the train- This set of clauses is already a DNF clas-
ing data. Most DT learning methods thus sifier for ci , but obviously scores high in
include a method for growing the tree and terms of overfitting. The learner applies
one for pruning it, that is, for removing then a process of generalization in which
the overly specific branches. Variations on the rule is simplified through a series
this basic schema for DT learning abound of modifications (e.g., removing premises
[Mitchell 1996, Section 3]. from clauses, or merging clauses) that
DT text classifiers have been used either maximize its compactness while at the
as the main classification tool [Fuhr et al. same time not affecting the covering
1991; Lewis and Catlett 1994; Lewis and property of the classifier. At the end of
Ringuette 1994], or as baseline classifiers this process, a pruning phase similar in
[Cohen and Singer 1999; Joachims 1998], spirit to that employed in DTs is applied,
or as members of classifier committees [Li where the ability to correctly classify all
and Jain 1998; Schapire and Singer 2000; the training examples is traded for more
Weiss et al. 1999]. generality.
DNF rule learners vary widely in terms
of the methods, heuristics and criteria
6.4. Decision Rule Classifiers employed for generalization and prun-
A classifier for category ci built by an ing. Among the DNF rule learners that
inductive rule learning method consists have been applied to TC are CHARADE
of a DNF rule, that is, of a conditional [Moulinier and Ganascia 1996], DL-ESC
rule with a premise in disjunctive normal [Li and Yamanishi 1999], RIPPER [Cohen
form (DNF), of the type illustrated in 1995a; Cohen and Hirsh 1998; Cohen and
Figure 1.14 The literals (i.e., possibly Singer 1999], SCAR [Moulinier et al. 1996],
negated keywords) in the premise denote and SWAP-1 [Apte 1994].
the presence (nonnegated keyword) or ab- While the methods above use rules
sence (negated keyword) of the keyword of propositional logic (PL), research has
in the test document d j , while the clause also been carried out using rules of first-
head denotes the decision to classify d j order logic (FOL), obtainable through
under ci . DNF rules are similar to DTs the use of inductive logic programming
in that they can encode any Boolean func- methods. Cohen [1995a] has extensively
tion. However, an advantage of DNF rule compared PL and FOL learning in TC
learners is that they tend to generate more (for instance, comparing the PL learner
compact classifiers than DT learners. RIPPER with its FOL version FLIPPER), and
Rule learning methods usually attempt has found that the additional represen-
to select from all the possible covering tational power of FOL brings about only
rules (i.e., rules that correctly classify modest benefits.
all the training examples) the best one
6.5. Regression Methods
14 Many inductive rule learning algorithms build Various TC efforts have used regression
decision lists (i.e., arbitrarily nested if-then-else
clauses) instead of DNF rules; since the former may models (see Fuhr and Pfeifer [1994]; Ittner
always be rewritten as the latter, we will disregard et al. [1995]; Lewis and Gale [1994];
the issue.
Schutze et al. [1995]). Regression denotes

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


24 Sebastiani

the approximation of a real-valued (in- same |T |-dimensional space in which doc-


stead than binary, as in the case of clas- uments are also represented, and such
sification) function 8 by means of a func- that CSV Pi (d j ) corresponds to the dot
|T | E
tion 8 that fits the training data [Mitchell product Ei . Note
k=1 wki wk j of d j and c
1996, page 236]. Here we will describe one that when both classifier and document
such model, the Linear Least-Squares Fit weights are cosine-normalized (see (2)),
(LLSF) applied to TC by Yang and Chute the dot product between the two vec-
[1994]. In LLSF, each document d j has tors corresponds to their cosine similarity,
two vectors associated to it: an input vec- that is:
tor I (d j ) of |T | weighted terms, and an
output vector O(d j ) of |C| weights rep- S(ci , d j ) = cos()
resenting the categories (the weights for P|T |
this latter vector are binary for training wki wk j
= qP k=1 qP ,
documents, and are nonbinary CSV 0 s for |T | |T |
k=1 wki
2 2
test documents). Classification may thus k=1 wk j
be seen as the task of determining an out-
put vector O(d j ) for test document d j , which represents the cosine of the angle
given its input vector I (d j ); hence, build- that separates the two vectors. This is
ing a classifier boils down to computing the similarity measure between query and
a |C| |T | matrix M such that MI(d j)= document computed by standard vector-
O(d j ). space IR engines, which means in turn
LLSF computes the matrix from the that once a linear classifier has been built,
training data by computing a linear least- classification can be performed by invok-
squares fit that minimizes the error on the ing such an engine. Practically all search
training set according to the formula M = engines have a dot product flavor to them,
arg min M kMI Ok F , where arg min M (x) and can therefore be adapted to doing TC
stands as usual for the qPM for which x is with a linear classifier.
def |C| P|T | Methods for learning linear classifiers
minimum, kV k F = i=1
2
j =1 vi j rep-
are often partitioned in two broad classes,
resents the so-called Frobenius norm of a batch methods and on-line methods.
|C| |T | matrix, I is the |T | |Tr| matrix Batch methods build a classifier by ana-
whose columns are the input vectors of the lyzing the training set all at once. Within
training documents, and O is the |C| |Tr| the TC literature, one example of a batch
matrix whose columns are the output vec- method is linear discriminant analysis,
tors of the training documents. The M ma-
a model of the stochastic dependence be-
trix is usually computed by performing a tween terms that relies on the covari-
singular value decomposition on the train- ance matrices of the categories [Hull 1994;
ing set, and its generic entry m ik repre-
Schutze et al. 1995]. However, the fore-
sents the degree of association between most example of a batch method is the
category ci and term tk . Rocchio method; because of its importance
The experiments of Yang and Chute in the TC literature, this will be discussed
[1994] and Yang and Liu [1999] indicate separately in Section 6.7. In this section
that LLSF is one of the most effective text we will instead concentrate on on-line
classifiers known to date. One of its disad- methods.
vantages, though, is that the cost of com- On-line (a.k.a. incremental) methods
puting the M matrix is much higher than
build a classifier soon after examining
that of many other competitors in the TC the first training document, and incre-
arena. mentally refine it as they examine new
ones. This may be an advantage in the
applications in which Tr is not avail-
6.6. On-Line Methods
able in its entirety from the start, or in
A linear classifier for category ci is a vec- which the meaning of the category may
tor cEi = hw1i , . . . , w|T |i i belonging to the change in time, as for example, in adaptive

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 25
+
filtering. This is also apt to applications wki and wki for each term tk ; the final
(e.g., semiautomated classification, adap- weight wki used in computing the dot prod-
+
tive filtering) in which we may expect the uct is the difference wki wki . Following
user of a classifier to provide feedback on the misclassification of a positive in-
+
how test documents have been classified, stance, active terms have their wki weight
as in this case further training may be per-
promoted and their wki weight demoted,
formed during the operating phase by ex- whereas in the case of a negative instance
ploiting user feedback. +
it is wki that gets demoted while wki gets
A simple on-line method is the per- promoted (for the rest, promotions and
ceptron algorithm, first applied to TC by demotions are as in POSITIVE WINNOW).

Schutze et al. [1995] and Wiener et al. BALANCED WINNOW allows negative wki
[1995], and subsequently used by Dagan weights, while in the perceptron and in
et al. [1997] and Ng et al. [1997]. In this al- POSITIVE WINNOW the wki weights are al-
gorithm, the classifier for ci is first initial- ways positive. In experiments conducted
ized by setting all weights wki to the same by Dagan et al. [1997], POSITIVE WINNOW
positive value. When a training example showed a better effectiveness than per-
d j (represented by a vector dE j of binary ceptron but was in turn outperformed by
weights) is examined, the classifier built (Dagan et al.s own version of) BALANCED
so far classifies it. If the result of the clas- WINNOW.
sification is correct, nothing is done, while Other on-line methods for building text
if it is wrong, the weights of the classifier classifiers are WIDROW-HOFF, a refinement
are modified: if d j was a positive exam- of it called EXPONENTIATED GRADIENT (both
ple of ci , then the weights wki of active applied for the first time to TC in [Lewis
terms (i.e., the terms tk such that wkj = 1) et al. 1996]) and SLEEPING EXPERTS [Cohen
are promoted by increasing them by a and Singer 1999], a version of BALANCED
fixed quantity > 0 (called learning rate), WINNOW. While the first is an additive
while if d j was a negative example of ci weight-updating algorithm, the second
then the same weights are demoted by and third are multiplicative. Key differ-
decreasing them by . Note that when the ences with the previously described al-
classifier has reached a reasonable level of gorithms are that these three algorithms
effectiveness, the fact that a weight wki is (i) update the classifier not only after mis-
very low means that tk has negatively con- classifying a training example, but also af-
tributed to the classification process so far, ter classifying it correctly, and (ii) update
and may thus be discarded from the repre- the weights corresponding to all terms (in-
sentation. We may then see the perceptron stead of just active ones).
algorithm (as all other incremental learn- Linear classifiers lend themselves to
ing methods) as allowing for a sort of on- both category-pivoted and document-
the-fly term space reduction [Dagan et al. pivoted TC. For the former the classifier
1997, Section 4.4]. The perceptron classi- cEi is used, in a standard search engine,
fier has shown a good effectiveness in all as a query against the set of test docu-
the experiments quoted above. ments, while for the latter the vector dE j
The perceptron is an additive weight- representing the test document is used
updating algorithm. A multiplicative as a query against the set of classifiers
variant of it is POSITIVE WINNOW [Dagan {Ec1 , . . . , cE|C| }.
et al. 1997], which differs from perceptron
because two different constants 1 > 1 and 6.7. The Rocchio Method
0 < 2 < 1 are used for promoting and de-
moting weights, respectively, and because Some linear classifiers consist of an ex-
promotion and demotion are achieved by plicit profile (or prototypical document)
multiplying, instead of adding, by 1 and of the category. This has obvious advan-
2 . BALANCED WINNOW [Dagan et al. 1997] tages in terms of interpretability, as such
is a further variant of POSITIVE WINNOW, in a profile is more readily understandable
which the classifier consists of two weights by a human than, say, a neural network

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


26 Sebastiani

classifier. Learning a linear classifier is of- sized, by setting to a high value and to
ten preceded by local TSR; in this case, a a low one (e.g., Cohen and Singer [1999],
profile of ci is a weighted list of the terms Ittner et al. [1995], and Joachims [1997]
whose presence or absence is most useful use = 16 and = 4).
for discriminating ci . This method is quite easy to implement,
The Rocchio method is used for induc- and is also quite efficient, since learning
ing linear, profile-style classifiers. It re- a classifier basically comes down to aver-
lies on an adaptation to TC of the well- aging weights. In terms of effectiveness,
known Rocchios formula for relevance instead, a drawback is that if the docu-
feedback in the vector-space model, and ments in the category tend to occur in
it is perhaps the only TC method rooted disjoint clusters (e.g., a set of newspaper
in the IR tradition rather than in the articles lebeled with the Sports category
ML one. This adaptation was first pro- and dealing with either boxing or rock-
posed by Hull [1994], and has been used climbing), such a classifier may miss most
by many authors since then, either as of them, as the centroid of these docu-
an object of research in its own right ments may fall outside all of these clusters
[Ittner et al. 1995; Joachims 1997; Sable (see Figure 3(a)). More generally, a classi-
and Hatzivassiloglou 2000; Schapire et al. fier built by the Rocchio method, as all lin-
1998; Singhal et al. 1997], or as a base- ear classifiers, has the disadvantage that
line classifier [Cohen and Singer 1999; it divides the space of documents linearly.
Galavotti et al. 2000; Joachims 1998; This situation is graphically depicted in
Lewis et al. 1996; Schapire and Singer Figure 3(a), where documents are classi-
2000; Schutze et al. 1995], or as a mem- fied within ci if and only if they fall within
ber of a classifier committee [Larkey and the circle. Note that even most of the pos-
Croft 1996] (see Section 6.11). itive training examples would not be clas-
Rocchios method computes a classi- sified correctly by the classifier.
fier cEi = hw1i , . . . , w|T |i i for category ci by
means of the formula
6.7.1. Enhancements to the Basic Rocchio
X wkj Framework. One issue in the application of
wki = the Rocchio formula to profile extraction
|POSi |
{d j POSi } is whether the set NEGi should be con-
X wkj sidered in its entirety, or whether a well-
, chosen sample of it, such as the set NPOSi
|NEGi |
{d j NEGi } of near-positives (defined as the most pos-
itive among the negative training exam-
where wkj is the weight of tk in document ples), should be selected from it, yielding
d j , POSi = {d j Tr | 8(d
j , ci ) = T }, and
X wkj
NEGi = {d j Tr | 8(d j , ci ) = F }. In this
wki =
formula, and are control parameters |POSi |
{d j POSi }
that allow setting the relative importance X wkj
of positive and negative examples. For .
instance, if is set to 1 and to 0 (as |NPOSi |
{d j NPOSi }
in Dumais et al. [1998]; Hull [1994];
Joachims [1998]; Schutze et al. [1995]), P wkj
The {d j NPOSi } |NPOS factor is more sig-
the profile of ci is the centroid of its pos- P i| wkj
itive training examples. A classifier built nificant than {d j NEGi } |NEG i|
, since near-
by means of the Rocchio method rewards positives are the most difficult documents
the closeness of a test document to the to tell apart from the positives. Using
centroid of the positive training examples, near-positives corresponds to the query
and its distance from the centroid of the zoning method proposed for IR by Singhal
negative training examples. The role of et al. [1997]. This method originates from
negative examples is usually deempha- the observation that, when the original

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 27

Fig. 3. A comparison between the TC behavior of (a) the Rocchio classifier, and
(b) the k-NN classifier. Small crosses and circles denote positive and negative
training instances, respectively. The big circles denote the influence area of
the classifier. Note that, for ease of illustration, document similarities are here
viewed in terms of Euclidean distance rather than, as is more common, in terms
of dot product or cosine.

Rocchio formula is used for relevance an effectiveness comparable to that of


feedback in IR, near-positives tend to a state-of-the-art ML method such as
be used rather than generic negatives, as boosting (see Section 6.11.1) while being
the documents on which user judgments 60 times quicker to train. These recent
are available are among the ones that results will no doubt bring about a re-
had scored highest in the previous rank- newed interest for the Rocchio classifier,
ing. Early applications of the Rocchio for- previously considered an underperformer
mula to TC (e.g., Hull [1994]; Ittner et al. [Cohen and Singer 1999; Joachims 1998;
[1995]) generally did not make a distinc-
Lewis et al. 1996; Schutze et al. 1995; Yang
tion between near-positives and generic 1999].
negatives. In order to select the near-
positives Schapire et al. [1998] issue a
6.8. Neural Networks
query, consisting of the centroid of the pos-
itive training examples, against a docu- A neural network (NN) text classifier is a
ment base consisting of the negative train- network of units, where the input units
ing examples; the top-ranked ones are the represent terms, the output unit(s) repre-
most similar to this centroid, and are then sent the category or categories of interest,
the near-positives. Wiener et al. [1995] in- and the weights on the edges connecting
stead equate the near-positives of ci to units represent dependence relations. For
the positive examples of the sibling cate- classifying a test document d j , its term
gories of ci , as in the application they work weights wkj are loaded into the input units;
on (TC with hierarchically organized cat- the activation of these units is propa-
egory sets) the notion of a sibling cate- gated forward through the network, and
gory of ci is well defined. A similar policy the value of the output unit(s) determines
is also adopted by Ng et al. [1997], Ruiz the categorization decision(s). A typical
and Srinivasan [1999], and Weigend et al. way of training NNs is backpropagation,
[1999]. whereby the term weights of a training
By using query zoning plus other en- document are loaded into the input units,
hancements (TSR, statistical phrases, and and if a misclassification occurs the error
a method called dynamic feedback op- is backpropagated so as to change the pa-
timization), Schapire et al. [1998] have rameters of the network and eliminate or
found that a Rocchio classifier can achieve minimize the error.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


28 Sebastiani

The simplest type of NN classifier is a positive decision is taken, and a nega-


the perceptron [Dagan et al. 1997; Ng tive decision is taken otherwise. Actually,
et al. 1997], which is a linear classifier and Yangs is a distance-weighted version of
as such has been extensively discussed k-NN (see [Mitchell 1996, Section 8.2.1]),
in Section 6.6. Other types of linear NN since the fact that a most similar docu-
classifiers implementing a form of logis- ment is in ci is weighted by its similar-
tic regression have also been proposed ity with the test document. Classifying d j
and tested by Schutze et al. [1995] and by means of k-NN thus comes down to
Wiener et al. [1995], yielding very good computing
effectiveness.
A nonlinear NN [Lam and Lee 1999; CSVi (d j )
Ruiz and Srinivasan 1999; Schutze et al. X
1995; Weigend et al. 1999; Wiener et al. = RSV(d j , d z ) [[8(d
z , ci )]],
1995; Yang and Liu 1999] is instead a net- d z Trk (d j )
work with one or more additional layers (9)
of units, which in TC usually represent
higher-order interactions between terms where Trk (d j ) is the set of the k documents
that the network is able to learn. When d z which maximize RSV(d j , d z ) and
comparative experiments relating nonlin-
ear NNs to their linear counterparts have
1 if = T
been performed, the former have yielded [[]] = .
either no improvement [Schutze et al. 0 if = F
1995] or very small improvements [Wiener
et al. 1995] over the latter. The thresholding methods of Section 6.1
can then be used to convert the real-
valued CSVi s into binary categorization
6.9. Example-Based Classifiers
decisions. In (9), RSV(d j , d z ) represents
Example-based classifiers do not build an some measure or semantic relatedness be-
explicit, declarative representation of the tween a test document d j and a training
category ci , but rely on the category la- document d z ; any matching function, be it
bels attached to the training documents probabilistic (as used by Larkey and Croft
similar to the test document. These meth- [1996]) or vector-based (as used by Yang
ods have thus been called lazy learners, [1994]), from a ranked IR system may be
since they defer the decision on how to used for this purpose. The construction of
generalize beyond the training data until a k-NN classifier also involves determin-
each new query instance is encountered ing (experimentally, on a validation set) a
[Mitchell 1996, page 244]. threshold k that indicates how many top-
The first application of example-based ranked training documents have to be con-
methods (a.k.a. memory-based reason- sidered for computing CSVi (d j ). Larkey
ing methods) to TC is due to Creecy, and Croft [1996] used k = 20, while Yang
Masand and colleagues [Creecy et al. [1994, 1999] has found 30 k 45 to yield
1992; Masand et al. 1992]; other examples the best effectiveness. Anyhow, various ex-
include Joachims [1998], Lam et al. [1999], periments have shown that increasing the
Larkey [1998], Larkey [1999], Li and Jain value of k does not significantly degrade
[1998], Yang and Pedersen [1997], and the performance.
Yang and Liu [1999]. Our presentation of Note that k-NN, unlike linear classi-
the example-based approach will be based fiers, does not divide the document space
on the k-NN (for k nearest neighbors) linearly, and hence does not suffer from
algorithm used by Yang [1994]. For decid- the problem discussed at the end of
ing whether d j ci , k-NN looks at whether Section 6.7. This is graphically depicted
the k training documents most similar to in Figure 3(b), where the more local
d j also are in ci ; if the answer is posi- character of k-NN with respect to Rocchio
tive for a large enough proportion of them, can be appreciated.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 29

This method is naturally geared toward The difference from the original k-NN ap-
document-pivoted TC, since ranking the proach is that if a training document d z
training documents for their similarity similar to the test document d j does not
with the test document can be done once belong to ci , this information is not dis-
for all categories. For category-pivoted TC, carded but weights negatively in the deci-
one would need to store the document sion to classify d j under ci .
ranks for each test document, which is ob- A combination of profile- and example-
viously clumsy; DPC is thus de facto the based methods was presented in Lam and
only reasonable way to use k-NN. Ho [1998]. In this work a k-NN system was
A number of different experiments (see fed generalized instances (GIs) in place of
Section 7.3) have shown k-NN to be quite training documents. This approach may be
effective. However, its most important seen as the result of
drawback is its inefficiency at classifica-
tion time: while, for example, with a lin- clustering the training set, thus obtain-
ear classifier only a dot product needs to ing a set of clusters K i = {ki1 , . . . ,
be computed to classify a test document, ki|K i | };
k-NN requires the entire training set to building a profile G(kiz ) (generalized
be ranked for similarity with the test docu- instance) from the documents belong-
ment, which is much more expensive. This ing to cluster kiz by means of some algo-
is a drawback of lazy learning methods, rithm for learning linear classifiers (e.g.,
since they do not have a true training Rocchio, WIDROW-HOFF);
phase and thus defer all the computation applying k-NN with profiles in place of
to classification time. training documents, that is, computing
6.9.1. Other Example-Based Techniques.
X
Various example-based techniques have CSVi (d j ) =
def
RSV(d j , G(kiz ))
been used in the TC literature. For exam-
kiz K i
ple, Cohen and Hirsh [1998] implemented
an example-based classifier by extending |{d j kiz | 8(d
j , ci ) = T }|
standard relational DBMS technology
|{d j kiz }|
with similarity-based soft joins. In
|{d j kiz }|
their WHIRL system they used the scoring
function |Tr|
X
= RSV(d j , G(kiz ))
CSVi (d j ) kiz K i
Y
(1 RSV(d j , d z ))[[8(d z ,ci )]]

=1 |{d j kiz | 8(d
j , ci ) = T }|
d z Trk (d j )
, (10)
|Tr|

as an alternative to (9), obtaining a small


|{d k | 8(d
j iz ,c )=T }|
j i
but statistically significant improvement where |{d j kiz }|
represents the
over a version of WHIRL using (9). In degree to which G(kiz ) is a positive in-
their experiments this technique outper- |{d j kiz }|
stance of ci , and |T r|
represents its
formed a number of other classifiers, such
as a C4.5 decision tree classifier and the weight within the entire process.
RIPPER CNF rule-based classifier. This exploits the superior effectiveness
A variant of the basic k-NN ap- (see Figure 3) of k-NN over linear clas-
proach was proposed by Galavotti et al. sifiers while at the same time avoiding
[2000], who reinterpreted (9) by redefining the sensitivity of k-NN to the presence of
[[]] as outliers (i.e., positive instances of ci that
lie out of the region where most other
1 if = T positive instances of ci are located) in the
[[]] = .
1 if = F training set.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


30 Sebastiani

middle element from the widest set of


parallel lines, that is, from the set in which
the maximum distance between two ele-
ments in the set is highest. It is notewor-
thy that this best decision surface is de-
termined by only a small set of training
examples, called the support vectors.
The method described is applicable also
to the case in which the positives and the
negatives are not linearly separable. Yang
and Liu [1999] experimentally compared
the linear case (namely, when the assump-
tion is made that the categories are lin-
early separable) with the nonlinear case
on a standard benchmark, and obtained
slightly better results in the former case.
Fig. 4. Learning support vector classifiers. As argued by Joachims [1998], SVMs
The small crosses and circles represent posi-
tive and negative training examples, respec- offer two important advantages for TC:
tively, whereas lines represent decision sur-
faces. Decision surface i (indicated by the term selection is often not needed, as
thicker line) is, among those shown, the best SVMs tend to be fairly robust to over-
possible one, as it is the middle element of fitting and can scale up to considerable
the widest set of parallel decision surfaces dimensionalities;
(i.e., its minimum distance to any training
example is maximum). Small boxes indicate no human and machine effort in param-
the support vectors. eter tuning on a validation set is needed,
as there is a theoretically motivated,
default choice of parameter settings,
6.10. Building Classifiers by Support which has also been shown to provide
Vector Machines the best effectiveness.
The support vector machine (SVM) method Dumais et al. [1998] have applied a
has been introduced in TC by Joachims novel algorithm for training SVMs which
[1998, 1999] and subsequently used by brings about training speeds comparable
Drucker et al. [1999], Dumais et al. [1998], to computationally easy learners such as
Dumais and Chen [2000], Klinkenberg Rocchio.
and Joachims [2000], Taira and Haruno
[1999], and Yang and Liu [1999]. In ge-
ometrical terms, it may be seen as the 6.11. Classifier Committees
attempt to find, among all the surfaces
1 , 2 , . . . in |T |-dimensional space that Classifier committees (a.k.a. ensembles)
separate the positive from the negative are based on the idea that, given a task
training examples (decision surfaces), the that requires expert knowledge to per-
i that separates the positives from the form, k experts may be better than one if
negatives by the widest possible margin, their individual judgments are appropri-
that is, such that the separation property ately combined. In TC, the idea is to ap-
is invariant with respect to the widest pos- ply k different classifiers 81 , . . . , 8k to the
sible traslation of i . same task of deciding whether d j ci , and
This idea is best understood in the case then combine their outcome appropriately.
in which the positives and the negatives A classifier committee is then character-
are linearly separable, in which case the ized by (i) a choice of k classifiers, and (ii)
decision surfaces are (|T |1)-hyperplanes. a choice of a combination function.
In the two-dimensional case of Figure 4, Concerning Issue (i), it is known from
various lines may be chosen as decision the ML literature that, in order to guar-
surfaces. The SVM method chooses the antee good effectiveness, the classifiers

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 31

forming the committee should be as in- can somehow profit from the complemen-
dependent as possible [Tumer and Ghosh tary strengths of their individual mem-
1996]. The classifiers may differ for the in- bers. However, the small size of the test set
dexing approach used, or for the inductive used (187 documents) suggests that more
method, or both. Within TC, the avenue experimentation is needed before conclu-
which has been explored most is the latter sions can be drawn.
(to our knowledge the only example of the Li and Jain [1998] have tested a commit-
former is Scott and Matwin [1999]). tee formed of (various combinations of) a
Concerning Issue (ii), various rules have Nave Bayes classifier, an example-based
been tested. The simplest one is majority classifier, a decision tree classifier, and a
voting (MV), whereby the binary outputs classifier built by means of their own sub-
of the k classifiers are pooled together, and space method; the combination rules they
the classification decision that reaches the have worked with are MV, DCS, and ACC.
majority of k+1
2
votes is taken (k obviously Only in the case of a committee formed
needs to be an odd number) [Li and Jain by Nave Bayes and the subspace classi-
1998; Liere and Tadepalli 1997]. This fier combined by means of ACC has the
method is particularly suited to the case committee outperformed, and by a nar-
in which the committee includes classi- row margin, the best individual classifier
fiers characterized by a binary decision (for every attempted classifier combina-
function CSVi : D {T, F }. A second rule tion ACC gave better results than MV and
is weighted linear combination (WLC), DCS). This seems discouraging, especially
whereby a weighted sum of the CSVi s pro- in light of the fact that the committee ap-
duced by the k classifiers yields the final proach is computationally expensive (its
CSVi . The weights w j reflect the expected cost trivially amounts to the sum of the
relative effectiveness of classifiers 8 j , and costs of the individual classifiers plus
are usually optimized on a validation set the cost incurred for the computation of
[Larkey and Croft 1996]. Another policy the combination rule). Again, it has to be
is dynamic classifier selection (DCS), remarked that the small size of their ex-
whereby among committee {81 , . . . , 8k } periment (two test sets of less than 700
the classifier 8t most effective on the l documents each were used) does not allow
validation examples most similar to d j us to draw definitive conclusions on the
is selected, and its judgment adopted by approaches adopted.
the committee [Li and Jain 1998]. A still
different policy, somehow intermediate 6.11.1. Boosting. The boosting method
between WLC and DCS, is adaptive [Schapire et al. 1998; Schapire and Singer
classifier combination (ACC), whereby the 2000] occupies a special place in the classi-
judgments of all the classifiers in the com- fier committees literature, since the k clas-
mittee are summed together, but their in- sifiers 81 , . . . , 8k forming the committee
dividual contribution is weighted by their are obtained by the same learning method
effectiveness on the l validation examples (here called the weak learner). The key
most similar to d j [Li and Jain 1998]. intuition of boosting is that the k clas-
Classifier committees have had mixed sifiers should be trained not in a con-
results in TC so far. Larkey and Croft ceptually parallel and independent way,
[1996] have used combinations of Rocchio, as in the committees described above,
Nave Bayes, and k-NN, all together or in but sequentially. In this way, in train-
pairwise combinations, using a WLC rule. ing classifier 8i one may take into ac-
In their experiments the combination of count how classifiers 81 , . . . , 8i1 perform
any two classifiers outperformed the best on the training examples, and concentrate
individual classifier (k-NN), and the com- on getting right those examples on which
bination of the three classifiers improved 81 , . . . , 8i1 have performed worst.
an all three pairwise combinations. These Specifically, for learning classifier 8t
results would seem to give strong sup- each hd j , ci i pair is given an importance
port to the idea that classifier committees weight hit j (where hi1j is set to be equal for

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


32 Sebastiani

all hd j , ci i pairs15 ), which represents how ple decision stumps used in Schapire
hard to get a correct decision for this and Singer [2000]), eventually combined
pair was for classifiers 81 , . . . , 8t1 . These by using the simple MV rule as a combi-
weights are exploited in learning 8t , nation rule; similarly to boosting, a mech-
which will be specially tuned to correctly anism for emphasising documents that
solve the pairs with higher weight. Clas- have been misclassified by previous de-
sifier 8t is then applied to the training cision trees is used. Boosting-based ap-
documents, and as a result weights hit j proaches have also been employed in
are updated to hit+1 j ; in this update oper-
Escudero et al. [2000], Iyer et al. [2000],
ation, pairs correctly classified by 8t will Kim et al. [2000], Li and Jain [1998], and
have their weight decreased, while pairs Myers et al. [2000].
misclassified by 8t will have their weight
increased. After all the k classifiers have
been built, a weighted linear combination 6.12. Other Methods
rule is applied to yield the final committee. Although in the previous sections we
In the BOOSTEXTER system [Schapire and have tried to give an overview as com-
Singer 2000], two different boosting al- plete as possible of the learning ap-
gorithms are tested, using a one-level proaches proposed in the TC literature, it
decision tree weak learner. The former is hardly possible to be exhaustive. Some
algorithm (ADABOOST.MH, simply called of the learning approaches adopted do
ADABOOST in Schapire et al. [1998]) is ex- not fall squarely under one or the other
plicitly geared toward the maximization of class of algorithms, or have remained
microaveraged effectiveness, whereas the somehow isolated attempts. Among these,
latter (ADABOOST.MR) is aimed at mini- the most noteworthy are the ones based
mizing ranking loss (i.e., at getting a cor- on Bayesian inference networks [Dumais
rect category ranking for each individual et al. 1998; Lam et al. 1997; Tzeras
document). In experiments conducted over and Hartmann 1993], genetic algorithms
three different test collections, Schapire [Clack et al. 1997; Masand 1994], and
et al. [1998] have shown ADABOOST to maximum entropy modelling [Manning
outperform SLEEPING EXPERTS, a classifier
and Schutze 1999].
that had proven quite effective in the ex-
periments of Cohen and Singer [1999].
Further experiments by Schapire and 7. EVALUATION OF TEXT CLASSIFIERS
Singer [2000] showed ADABOOST to out-
perform, aside from SLEEPING EXPERTS, a As for text search systems, the eval-
Nave Bayes classifier, a standard (nonen- uation of document classifiers is typ-
hanced) Rocchio classifier, and Joachims ically conducted experimentally, rather
[1997] PRTFIDF classifier. than analytically. The reason is that, in
A boosting algorithm based on a com- order to evaluate a system analytically
mittee of classifier subcommittees that (e.g., proving that the system is correct
improves on the effectiveness and (espe- and complete), we would need a formal
cially) the efficiency of ADABOOST.MH was specification of the problem that the sys-
presented in Sebastiani et al. [2000]. An tem is trying to solve (e.g., with respect
approach similar to boosting was also em- to what correctness and completeness are
ployed by Weiss et al. [1999], who experi- defined), and the central notion of TC
mented with committees of decision trees (namely, that of membership of a docu-
each having an average of 16 leaves (and ment in a category) is, due to its subjective
hence much more complex than the sim- character, inherently nonformalizable.
The experimental evaluation of a clas-
sifier usually measures its effectiveness
15 Schapire et al. [1998] also showed that a simple (rather than its efficiency), that is, its
modification of this policy allows optimization of the ability to take the right classification
classifier based on utility (see Section 7.1.3). decisions.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 33

Table II. The Contingency Table for Category c i Table III. The Global Contingency Table
Category Expert judgments Category set Expert judgments
ci YES NO C = {c1 , . . . , c|C| } YES NO
Classifier YES TPi FPi X
|C|
X
|C|
Judgments NO FNi TNi Classifier YES TP = TPi FP = FPi
i=1 i=1

7.1. Measures of Text X|C|


X|C|

Judgments NO FN = FNi TN = TNi


Categorization Effectiveness
i=1 i=1
7.1.1. Precision and Recall. Classification
effectiveness is usually measured in terms
of the classic IR notions of precision ( ) microaveraging: and are obtained by
and recall (), adapted to the case of summing over all individual decisions:
P|C|
TC. Precision wrt ci (i ) is defined as TP TPi
the conditional probability P (8(d x , ci ) = = = P|C| i=1 ,
TP + FP i=1 (TPi + FPi )
T | 8(d x , ci ) = T ), that is, as the prob- P|C|
ability that if a random document d x is TP TPi
classified under ci , this decision is correct. = = P|C| i=1 ,
TP + FN i=1 (TPi + FNi )
Analogously, recall wrt ci (i ) is defined
as P (8(d x , ci ) = T | 8(d
x , ci ) = T ), that where indicates microaverag-
is, as the probability that, if a random ing. The global contingency table
document d x ought to be classified under (Table III) is thus obtained by sum-
ci , this decision is taken. These category- ming over category-specific contin-
relative values may be averaged, in a way gency tables;
to be discussed shortly, to obtain and , macroaveraging: precision and recall
that is, values global to the entire category are first evaluated locally for each
set. Borrowing terminology from logic, category, and then globally by aver-
may be viewed as the degree of sound- aging over the results of the different
ness of the classifier wrt C, while may categories:
be viewed as its degree of completeness P|C| P|C|
wrt C. As defined here, i and i are to i=1
i i
=
M
, = i=1 ,
M
be understood as subjective probabilities, |C| |C|
that is, as measuring the expectation of where M indicates macroaveraging.
the user that the system will behave cor-
rectly when classifying an unseen docu- These two methods may give quite dif-
ment under ci . These probabilities may ferent results, especially if the different
be estimated in terms of the contingency categories have very different generality.
table for ci on a given test set (see Table II). For instance, the ability of a classifier to
Here, FPi (false positives wrt ci , a.k.a. behave well also on categories with low
errors of commission) is the number of generality (i.e., categories with few pos-
test documents incorrectly classified un- itive training instances) will be empha-
der ci ; TNi (true negatives wrt ci ), TPi (true sized by macroaveraging and much less
positives wrt ci ), and FNi (false negatives so by microaveraging. Whether one or the
wrt ci , a.k.a. errors of omission) are de- other should be used obviously depends on
fined accordingly. Estimates (indicated by the application requirements. From now
carets) of precision wrt ci and recall wrt ci on, we will assume that microaveraging is
may thus be obtained as used; everything we will say in the rest of
Section 7 may be adapted to the case of
macroaveraging in the obvious way.
TPi TPi
i = , i = .
TPi + FPi TPi + FNi 7.1.2. Other Measures of Effectiveness.
Measures alternative to and and
For obtaining estimates of and , two commonly used in the ML litera-
different methods may be adopted: ture, such as accuracy (estimated as

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


34 Sebastiani

A = TP+TN+FP+FN
TP+TN
) and error (estimated Table IV. The Utility Matrix
Category set Expert judgments
as E = TP+TN+FP+FN
FP+FN
= 1 A), are not
C = {c1 , . . . , c|C| } YES NO
widely used in TC. The reason is that, as Classifier YES uTP uFP
Yang [1999] pointed out, the large value Judgments NO uFN uTN
that their denominator typically has in
TC makes them much more insensitive to
variations in the number of correct deci- classifiers with similar effectiveness. An
sions (TP + TN) than and . Besides, if interesting evaluation has been carried
A is the adopted evaluation measure, in out by Dumais et al. [1998], who have
the frequent case of a very low average compared five different learning methods
generality the trivial rejector (i.e., the along three different dimensions, namely,
classifier 8 such that 8(d j , ci ) = F for effectiveness, training efficiency (i.e., the
all d j and ci ) tends to outperform all average time it takes to build a classifier
nontrivial classifiers (see also Cohen for category ci from a training set Tr), and
[1995a], Section 2.3). If A is adopted, classification efficiency (i.e., the average
parameter tuning on a validation set may time it takes to classify a new document
thus result in parameter choices that d j under category ci ).
make the classifier behave very much like An important alternative to effective-
the trivial rejector. ness is utility, a class of measures from
A nonstandard effectiveness mea- decision theory that extend effectiveness
sure was proposed by Sable and by economic criteria such as gain or loss.
Hatzivassiloglou [2000, Section 7], who Utility is based on a utility matrix such
suggested basing and not on abso- as that of Table IV, where the numeric
lute values of success and failure (i.e., 1 values uTP , uFP , uFN and uTN represent
if 8(d j , ci ) = 8(d
j , ci ) and 0 if 8(d j , ci ) 6= the gain brought about by a true positive,
8(d j , ci )), but on values of relative suc-
false positive, false negative, and true neg-
cess (i.e., CSVi (d j ) if 8(d j , ci ) = T and ative, respectively; both uTP and uTN are
greater than both uFP and uFN . Standard
1 CSVi (d j ) if 8(d
j , ci ) = F ). This means
effectiveness is a special case of utility,
that for a correct (respectively wrong) i.e., the one in which uTP = uTN > uFP =
decision the classifier is rewarded (re- uFN . Less trivial cases are those in
spectively penalized) proportionally to its which uTP 6= uTN and/or uFP 6= uFN ; this
confidence in the decision. This proposed is appropriate, for example, in spam fil-
measure does not reward the choice of a tering, where failing to discard a piece
good thresholding policy, and is thus unfit of junk mail (FP) is a less serious mis-
for autonomous (hard) classification take than discarding a legitimate mes-
systems. However, it might be appropri- sage (FN) [Androutsopoulos et al. 2000].
ate for interactive (ranking) classifiers If the classifier outputs probability esti-
of the type used in Larkey [1999], where mates of the membership of d j in ci , then
the confidence that the classifier has decision theory provides analytical meth-
in its own decision influences category ods to determine thresholds i , thus avoid-
ranking and, as a consequence, the overall ing the need to determine them exper-
usefulness of the system. imentally (as discussed in Section 6.1).
Specifically, as Lewis [1995a] reminds us,
7.1.3. Measures Alternative to Effectiveness. the expected value of utility is maximized
In general, criteria different from effec- when
tiveness are seldom used in classifier eval-
uation. For instance, efficiency, although (uFP uTN )
very important for applicative purposes, i = ,
is seldom used as the sole yardstick, due (uFN uTP ) + (uFP uTN )
to the volatility of the parameters on
which the evaluation rests. However, ef- which, in the case of standard effective-
ficiency may be useful for choosing among ness, is equal to 12 .

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 35

Table V. Trivial Cases in TC


Precision Recall C-precision C-recall
TP TP TN TN
TP + FP TP + FN FP + TN TN + FN

0 TN TN
Trivial rejector TP = FP = 0 Undefined =0 =1
FN TN TN + FN
TP TP 0
Trivial acceptor FN = TN = 0 =1 =0 Undefined
TP + FP TP FP
TP TP 0
Trivial Yes collection FP = TN = 0 =1 Undefined =0
TP TP + FN FN
0 TN TN
Trivial No collection TP = FN = 0 =0 Undefined =1
FP FP + TN TN

The use of utility in TC is discussed 7.1.4. Combined Effectiveness Measures.


in detail by Lewis [1955a]. Other works Neither precision nor recall makes sense
where utility is employed are Amati and in isolation from each other. In fact the
Crestani [1999], Cohen and Singer [1999], classifier 8 such that 8(d j , ci ) = T for all
Hull et al. [1996], Lewis and Catlett d j and ci (the trivial acceptor) has = 1.
[1994], and Schapire et al. [1998]. Utility When the CSVi function has values in
has become popular within the text filter- [0, 1], one only needs to set every thresh-
ing community, and the TREC filtering old i to 0 to obtain the trivial acceptor.
track evaluations have been using it for In this case, would usually be very low
a while [Lewis 1995c]. The values of the (more precisely,Pequal to the average test
|C|
utility matrix are extremely application- g T e (ci ) 16
set generality i=1|C| ). Conversely, it
dependent. This means that if utility is
is well known from everyday IR practice
used instead of pure effectiveness, there
that higher levels of may be obtained at
is a further element of difficulty in the
the price of low values of .
cross-comparison of classification systems
In practice, by tuning i a function
(see Section 7.3), since for two classifiers
CSVi : D {T, F } is tuned to be, in the
to be experimentally comparable also the
words of Riloff and Lehnert [1994], more
two utility matrices must be the same.
liberal (i.e., improving i to the detriment
Other effectiveness measures different
of i ) or more conservative (improving i to
from the ones discussed here have occa-
sionally been used in the literature; these
include adjacent score [Larkey 1998],
coverage [Schapire and Singer 2000], one- 16 From this, one might be tempted to infer, by sym-
error [Schapire and Singer 2000], Pear- metry, that the trivial rejector always has = 1.
son product-moment correlation [Larkey This is false, as is undefined (the denominator is
1998], recall at n [Larkey and Croft 1996], zero) for the trivial rejector (see Table V). In fact,
top candidate [Larkey and Croft 1996], it is clear from its definition ( = TPTP +FP ) that
and top n [Larkey and Croft 1996]. We depends only on how the positives (TP + FP ) are
split between true positives TP and the false posi-
will not attempt to discuss them in detail. tives FP , and does not depend at all on the cardinal-
However, their use shows that, although ity of the positives. There is a breakup of symme-
the TC community is making consistent try between and here because, from the point of
efforts at standardizing experimentation view of classifier judgment (positives vs. negatives;
this is the dichotomy of interest in trivial acceptor vs.
protocols, we are still far from universal trivial rejector), the symmetric of ( TPTP
+FN ) is not
agreement on evaluation issues and, as
( TPTP
+FP ) but C-precision ( c= TN
FP +TN ), the con-
a consequence, from understanding pre- trapositive of . In fact, while = 1 and c = 0 for
cisely the relative merits of the various the trivial acceptor, c = 1 and = 0 for the trivial
methods. rejector.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


36 Sebastiani

the detriment of i ).17 A classifier should Joachims [1999]; Lewis [1992a]; Lewis
thus be evaluated by means of a mea- and Ringuette [1994]; Moulinier and
sure which combines and .18 Vari- Ganascia [1996]; Ng et al. [1997]; Yang
ous such measures have been proposed, [1999]). This is obtained by a process
among which the most frequent are: analogous to the one used for 11-point
average precision: a plot of as a func-
(1) Eleven-point average precision: thresh-
tion of is computed by repeatedly
old i is repeatedly tuned so as to allow
varying the thresholds i ; breakeven
i to take up values of 0.0, .1, . . . , .9,
is the value of (or ) for which the
1.0; i is computed for these 11 differ-
plot intersects the = line. This idea
ent values of i , and averaged over the
relies on the fact that, by decreasing
11 resulting values. This is analogous
the i s from 1 to 0, always increases
to the standard evaluation methodol-
monotonically from 0 to 1 and usu-
ogy for ranked IR systems, and may be
ally decreases monotonically from a
used 1 P|C|
value near 1 to |C| i=1 g Te (ci ). If for
(a) with categories in place of IR
no values of the i s and are ex-
queries. This is most frequently
actly equal, the i s are set to the value
used for document-ranking clas-
for which and are closest, and an

sifiers (see Schutze et al. [1995];
interpolated breakeven is computed as
Yang [1994]; Yang [1999]; Yang and
the average of the values of and .19
Pedersen [1997]);
(3) The F function [van Rijsbergen 1979,
(b) with test documents in place of
Chapter 7], for some 0 +
IR queries and categories in place
(e.g., Cohen [1995a]; Cohen and Singer
of documents. This is most fre-
[1999]; Lewis and Gale [1994]; Lewis
quently used for category-ranking
[1995a]; Moulinier et al. [1996]; Ruiz
classifiers (see Lam et al. [1999];
and Srinivassan [1999]), where
Larkey and Croft [1996]; Schapire
and Singer [2000]; Wiener et al. ( 2 + 1)
F =
[1995]). In this case, if macroav- 2 +
eraging is used, it needs to be re- Here may be seen as the relative de-
defined on a per-document, rather gree of importance attributed to and
than per-category, basis. . If = 0 then F coincides with ,
This measure does not make sense for whereas if = + then F coincides
binary-valued CSVi functions, since in with . Usually, a value = 1 is used,
this case i may not be varied at will. which attributes equal importance to
(2) The breakeven point, that is, the and . As shown in Moulinier et al.
value at which equals (e.g., Apte [1996] and Yang [1999], the breakeven
et al. [1994]; Cohen and Singer [1999]; of a classifier 8 is always less or equal
Dagan et al. [1997]; Joachims [1998]; than its F1 value.

17 While can always be increased at will by low- 19


i Breakeven, first proposed by Lewis [1992a, 1992b],
ering i , usually at the cost of decreasing i , i can has been recently criticized. Lewis himself (see his
usually be increased at will by raising i , always at message of 11 Sep 1997 10:49:01 to the DDLBETA
the cost of decreasing i . This kind of tuning is only text categorization mailing listquoted with permis-
possible for CSVi functions with values in [0, 1]; for sion of the author) has pointed out that breakeven is
binary-valued CSVi functions tuning is not always not a good effectiveness measure, since (i) there may
possible, or is anyway more difficult (see Weiss et al. be no parameter setting that yields the breakeven; in
[1999], page 66). this case the final breakeven value, obtained by in-
18 An exception is single-label TC, in which and terpolation, is artificial; (ii) to have equal is not
are not independent of each other: if a document d j necessarily desirable, and it is not clear that a system
has been classified under a wrong category cs (thus that achieves high breakeven can be tuned to score
decreasing s ), this also means that it has not been high on other effectiveness measures. Yang [1999]
classified under the right category ct (thus decreas- also noted that when for no value of the parameters
ing t ). In this case either or can be used as a and are close enough, interpolated breakeven may
measure of effectiveness. not be a reliable indicator of effectiveness.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 37

Once an effectiveness measure is chosen, a lack of compliance with these three


a classifier can be tuned (e.g., thresh- conditions may make the experimental
olds and other parameters can be set) results hardly comparable among each
so that the resulting effectiveness is the other. Table VI lists the results of all
best achievable by that classifier. Tun- experiments known to us performed on
ing a parameter p (be it a threshold or five major versions of the Reuters bench-
other) is normally done experimentally. mark: Reuters-22173 ModLewis (column
This means performing repeated experi- #1), Reuters-22173 ModApte (column #2),
ments on the validation set with the val- Reuters-22173 ModWiener (column #3),
ues of the other parameters pk fixed (at Reuters-21578 ModApte (column #4),
a default value, in the case of a yet-to- and Reuters-21578[10] ModApte (column
be-tuned parameter pk , or at the chosen #5).20 Only experiments that have com-
value, if the parameter pk has already puted either a breakeven or F1 have been
been tuned) and with different values for listed, since other less popular effective-
parameter p. The value that has yielded ness measures do not readily compare
the best effectiveness is chosen for p. with these.
Note that only results belonging to the
same column are directly comparable.
7.2. Benchmarks for Text Categorization In particular, Yang [1999] showed that
experiments carried out on Reuters-22173
Standard benchmark collections that can
ModLewis (column #1) are not directly
be used as initial corpora for TC are publi-
comparable with those using the other
cally available for experimental purposes.
three versions, since the former strangely
The most widely used is the Reuters col-
includes a significant percentage (58%) of
lection, consisting of a set of newswire
unlabeled test documents which, being
stories classified under categories related
negative examples of all categories, tend
to economics. The Reuters collection ac-
to depress effectiveness. Also, experi-
counts for most of the experimental work
ments performed on Reuters-21578[10]
in TC so far. Unfortunately, this does not
ModApte (column #5) are not comparable
always translate into reliable comparative
with the others, since this collection is the
results, in the sense that many of these ex-
restriction of Reuters-21578 ModApte to
periments have been carried out in subtly
the 10 categories with the highest gen-
different conditions.
erality, and is thus an obviously easier
In general, different sets of experiments
collection.
may be used for cross-classifier compar-
Other test collections that have been
ison only if the experiments have been
frequently used are
performed
(1) on exactly the same collection (i.e., the OHSUMED collection, set up
same documents and same categories); by Hersh et al. [1994] and used by
Joachims [1998], Lam and Ho [1998],
(2) with the same split between training
Lam et al. [1999], Lewis et al. [1996],
set and test set;
Ruiz and Srinivasan [1999], and Yang
(3) with the same evaluation measure
and, whenever this measure depends
on some parameters (e.g., the utility
matrix chosen), with the same param- 20 The Reuters-21578 collection may be freely down-
eter values. loaded for experimentation purposes from http://
www.research.att.com/~lewis/reuters21578.html.
Unfortunately, a lot of experimentation, A new corpus, called Reuters Corpus Volume 1 and
both on Reuters and on other collec- consisting of roughly 800,000 documents, has
tions, has not been performed with these recently been made available by Reuters for
TC experiments (see http://about.reuters.com/
caveats in mind: by testing three differ- researchandstandards/corpus/). This will likely
ent classifiers on five popular versions replace Reuters-21578 as the standard Reuters
of Reuters, Yang [1999] has shown that benchmark for TC.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


38 Sebastiani

Table VI. Comparative Results Among Different Classifiers Obtained on Five Different Versions of Reuters.
(Unless otherwise noted, entries indicate the microaveraged breakeven point; within parentheses, M
indicates macroaveraging and F 1 indicates use of the F 1 measure; boldface indicates the best
performer on the collection.)
#1 #2 #3 #4 #5
# of documents 21,450 14,347 13,272 12,902 12,902
# of training documents 14,704 10,667 9,610 9,603 9,603
# of test documents 6,746 3,680 3,662 3,299 3,299
# of categories 135 93 92 90 10
System Type Results reported by
WORD (non-learning) Yang [1999] .150 .310 .290
probabilistic [Dumais et al. 1998] .752 .815
probabilistic [Joachims 1998] .720
probabilistic [Lam et al. 1997] .443 (MF1 )
PROPBAYES probabilistic [Lewis 1992a] .650
BIM probabilistic [Li and Yamanishi 1999] .747
probabilistic [Li and Yamanishi 1999] .773
NB probabilistic [Yang and Liu 1999] .795
decision trees [Dumais et al. 1998] .884
C4.5 decision trees [Joachims 1998] .794
IND decision trees [Lewis and Ringuette 1994] .670
SWAP-1 decision rules [Apte et al. 1994] .805
RIPPER decision rules [Cohen and Singer 1999] .683 .811 .820
SLEEPINGEXPERTS decision rules [Cohen and Singer 1999] .753 .759 .827
DL-ESC decision rules [Li and Yamanishi 1999] .820
CHARADE decision rules [Moulinier and Ganascia 1996] .738
CHARADE decision rules [Moulinier et al. 1996] .783 (F1 )
LLSF regression [Yang 1999] .855 .810
LLSF regression [Yang and Liu 1999] .849
BALANCEDWINNOW on-line linear [Dagan et al. 1997] .747 (M) .833 (M)
WIDROW-HOFF on-line linear [Lam and Ho 1998] .822
ROCCHIO batch linear [Cohen and Singer 1999] .660 .748 .776
FINDSIM batch linear [Dumais et al. 1998] .617 .646
ROCCHIO batch linear [Joachims 1998] .799
ROCCHIO batch linear [Lam and Ho 1998] .781
ROCCHIO batch linear [Li and Yamanishi 1999] .625
CLASSI neural network [Ng et al. 1997] .802
NNET neural network Yang and Liu 1999] .838
neural network [Wiener et al. 1995] .820
GIS-W example-based [Lam and Ho 1998] .860
k-NN example-based [Joachims 1998] .823
k-NN example-based [Lam and Ho 1998] .820
k-NN example-based [Yang 1999] .690 .852 .820
k-NN example-based [Yang and Liu 1999] .856
SVM [Dumais et al. 1998] .870 .920
SVMLIGHT SVM [Joachims 1998] .864
SVMLIGHT SVM [Li Yamanishi 1999] .841
SVMLIGHT SVM [Yang and Liu 1999] .859
ADABOOST.MH committee [Schapire and Singer 2000] .860
committee [Weiss et al. 1999] .878
Bayesian net [Dumais et al. 1998] .800 .850
Bayesian net [Lam et al. 1997] .542 (MF1 )

and Pedersen [1997].21 The documents the categories are the postable terms
are titles or title-plus-abstracts from of the MESH thesaurus.
medical journals (OHSUMED is actually
a subset of the Medline document base); the 20 Newsgroups collection, set up
by Lang [1995] and used by Baker
21 The OHSUMED collection may be freely down- and McCallum [1998], Joachims
loaded for experimentation purposes from ftp:// [1997], McCallum and Nigam [1998],
medir.ohsu.edu/pub/ohsumed. McCallum et al. [1998], Nigam et al.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 39

[2000], and Schapire and Singer [2000]. same background conditions. This is the
The documents are messages posted to more reliable method.
Usenet newsgroups, and the categories indirect comparison: classifiers 80 and
are the newsgroups themselves. 800 may be compared when
the AP collection, used by Cohen [1995a, (1) they have been tested on collections
1995b], Cohen and Singer [1999], Lewis 0 and 00 , respectively, typically
and Catlett [1994], Lewis and Gale by different researchers and hence
[1994], Lewis et al. [1996], Schapire with possibly different background
and Singer [2000], and Schapire et al. conditions;
[1998]. (2) one or more baseline classifiers
We will not cover the experiments per- 8 1, . . . , 8
m have been tested on both
formed on these collections for the same 0 and 00 by the direct comparison
reasons as those illustrated in footnote 20, method.
that is, because in no case have a signifi- Test 2 gives an indication on the rela-
cant enough number of authors used the tive hardness of 0 and 00 ; using this
same collection in the same experimen- and the results from Test 1, we may
tal conditions, thus making comparisons obtain an indication on the relative ef-
difficult. fectiveness of 80 and 800 . For the rea-
sons discussed above, this method is less
reliable.
7.3. Which Text Classifier Is Best? A number of interesting conclusions can be
The published experimental results, and drawn from Table VI by using these two
especially those listed in Table VI, allow methods. Concerning the relative hard-
us to attempt some considerations on the ness of the five collections, if by 0 > 00
comparative performance of the TC meth- we indicate that 0 is a harder collection
ods discussed. However, we have to bear in than 00 , there seems to be enough evi-
mind that comparisons are reliable only dence that Reuters-22173 ModLewis
when based on experiments performed Reuters-22173 ModWiener > Reuters-
by the same author under carefully con- 22173 ModApte Reuters-21578 Mod-
trolled conditions. They are instead more Apte > Reuters-21578[10] ModApte.
problematic when they involve different These facts are unsurprising; in particu-
experiments performed by different au- lar, the first and the last inequalities are a
thors. In this case various background direct consequence of the peculiar charac-
conditions, often extraneous to the learn- teristics of Reuters-22173 ModLewis and
ing algorithm itself, may influence the re- Reuters-21578[10] ModApte discussed in
sults. These may include, among others, Section 7.2.
different choices in preprocessing (stem- Concerning the relative performance of
ming, etc.), indexing, dimensionality re- the classifiers, remembering the consid-
duction, classifier parameter values, etc., erations above we may attempt a few
but also different standards of compliance conclusions:
with safe scientific practice (such as tun- Boosting-based classifier committees,
ing parameters on the test set rather than support vector machines, example-
on a separate validation set), which often based methods, and regression methods
are not discussed in the published papers. deliver top-notch performance. There
Two different methods may thus be seems to be no sufficient evidence to
applied for comparing classifiers [Yang decidedly opt for either method; ef-
1999]: ficiency considerations or application-
direct comparison: classifiers 80 and 800 dependent issues might play a role in
may be compared when they have been breaking the tie.
tested on the same collection , usually Neural networks and on-line linear clas-
by the same researchers and with the sifiers work very well, although slightly

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


40 Sebastiani

worse than the previously mentioned which the .90 breakeven value was ob-
methods. tained was chosen randomly, as safe sci-
Batch linear classifiers (Rocchio) and entific practice would demand. Therefore,
probabilistic Nave Bayes classifiers the fact that this figure is indicative of the
look the worst of the learning-based performance of CONSTRUE, and of the man-
classifiers. For Rocchio, these results ual approach it represents, has been con-
confirm earlier results by Schutze vincingly questioned [Yang 1999].
et al. [1995], who had found three classi- It is important to bear in mind that
fiers based on linear discriminant anal- the considerations above are not abso-
ysis, linear regression, and neural net- lute statements (if there may be any)
works to perform about 15% better on the comparative effectiveness of these
than Rocchio. However, recent results TC methods. One of the reasons is that
by Schapire et al. [1998] ranked Rocchio a particular applicative context may ex-
along the best performers once near- hibit very different characteristics from
positives are used in training. the ones to be found in Reuters, and dif-
The data in Table VI is hardly suf- ferent classifiers may respond differently
ficient to say anything about decision to these characteristics. An experimen-
trees. However, the work by Dumais tal study by Joachims [1998] involving
et al. [1998], in which a decision tree support vector machines, k-NN, decision
classifier was shown to perform nearly trees, Rocchio, and Nave Bayes, showed
as well as their top performing system all these classifiers to have similar ef-
(a SVM classifier), will probably renew fectiveness on categories with 300 pos-
the interest in decision trees, an interest itive training examples each. The fact
that had dwindled after the unimpres- that this experiment involved the meth-
sive results reported in earlier litera- ods which have scored best (support vec-
ture [Cohen and Singer 1999; Joachims tor machines, k-NN) and worst (Rocchio
1998; Lewis and Catlett 1994; Lewis and Nave Bayes) according to Table VI
and Ringuette 1994]. shows that applicative contexts different
from Reuters may well invalidate conclu-
By far the lowest performance is sions drawn on this latter.
displayed by WORD, a classifier im- Finally, a note about the worth of sta-
plemented by Yang [1999] and not tistical significance testing. Few authors
including any learning component.22 have gone to the trouble of validating their
Concerning WORD and no-learning classi- results by means of such tests. These tests
fiers, for completeness we should recall are useful for verifying how strongly the
that one of the highest effectiveness values experimental results support the claim
reported in the literature for the Reuters that a given system 80 is better than an-
collection (a .90 breakeven) belongs to other system 800 , or for verifying how much
CONSTRUE, a manually constructed clas- a difference in the experimental setup af-
sifier. However, this classifier has never fects the measured effectiveness of a sys-
been tested on the standard variants of tem 8. Hull [1994] and Schutze et al.
Reuters mentioned in Table VI, and it is [1995] have been among the first to work
not clear [Yang 1999] whether the (small) in this direction, validating their results
test set of Reuters-22173 ModHayes on by means of the ANOVA test and the Fried-
man test; the former is aimed at determin-
22 WORD is based on the comparison between docu- ing the significance of the difference in ef-
ments and category names, each treated as a vector of fectiveness between two methods in terms
weighted terms in the vector space model. WORD was of the ratio between this difference and the
implemented by Yang with the only purpose of de- effectiveness variability across categories,
termining the difference in effectiveness that adding while the latter conducts a similar test by
a learning component to a classifier brings about.
WORD is actually called STR in [Yang 1994; Yang and
using instead the rank positions of each
Chute 1994]. Another no-learning classifier was pro- method within a category. Yang and Liu
posed in Wong et al. [1996]. [1999] defined a full suite of significance

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 41

tests, some of which apply to microaver- in the TC arena of ML methods that


aged and some to macroaveraged effective- are backed by strong theoretical motiva-
ness. They applied them systematically tions. Examples of these are multiplica-
to the comparison between five different tive weight updating (e.g., the WINNOW
classifiers, and were thus able to infer fine- family, WIDROW-HOFF, etc.), adaptive re-
grained conclusions about their relative sampling (e.g., boosting), and support vec-
effectiveness. For other examples of sig- tor machines, which provide a sharp con-
nificance testing in TC, see Cohen [1995a, trast with relatively unsophisticated and
1995b]; Cohen and Hirsh [1998], Joachims weak methods such as Rocchio. In TC,
[1997], Koller and Sahami [1997], Lewis ML researchers have found a challeng-
et al. [1996], and Wiener et al. [1995]. ing application, since datasets consisting
of hundreds of thousands of documents
and characterized by tens of thousands of
8. CONCLUSION terms are widely available. This means
Automated TC is now a major research that TC is a good benchmark for checking
area within the information systems dis- whether a given learning technique can
cipline, thanks to a number of factors: scale up to substantial sizes. In turn, this
probably means that the active involve-
Its domains of application are numer- ment of the ML community in TC is bound
ous and important, and given the pro- to grow.
liferation of documents in digital form The success story of automated TC is
they are bound to increase dramatically also going to encourage an extension of
in both number and importance. its methods and techniques to neighbor-
It is indispensable in many applica- ing fields of application. Techniques typ-
tions in which the sheer number of ical of automated TC have already been
the documents to be classified and the extended successfully to the categoriza-
short response time required by the ap- tion of documents expressed in slightly dif-
plication make the manual alternative ferent media; for instance:
implausible.
It can improve the productivity of very noisy text resulting from opti-
human classifiers in applications in cal character recognition [Ittner et al.
which no classification decision can be 1995; Junker and Hoch 1998]. In their
taken without a final human judgment experiments Ittner et al. [1995] have
[Larkey and Croft 1996], by providing found that, by employing noisy texts
tools that quickly suggest plausible also in the training phase (i.e. texts af-
decisions. fected by the same source of noise that
is also at work in the test documents),
It has reached effectiveness levels com-
effectiveness levels comparable to those
parable to those of trained profession-
obtainable in the case of standard text
als. The effectiveness of manual TC is
can be achieved.
not 100% anyway [Cleverdon 1984] and,
more importantly, it is unlikely to be speech transcripts [Myers et al.
improved substantially by the progress 2000; Schapire and Singer 2000].
of research. The levels of effectiveness For instance, Schapire and Singer
of automated TC are instead growing [2000] classified answers given to a
at a steady pace, and even if they will phone operators request How may I
likely reach a plateau well below the help you? so as to be able to route the
100% level, this plateau will probably call to a specialized operator according
be higher than the effectiveness levels to call type.
of manual TC.
Concerning other more radically differ-
One of the reasons why from the early ent media, the situation is not as bright
90s the effectiveness of text classifiers (however, see Lim [1999] for an interest-
has dramatically improved is the arrival ing attempt at image categorization based

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


42 Sebastiani

on a textual metaphor). The reason for APTE , C., DAMERAU, F. J., AND WEISS, S. M. 1994.
this is that capturing real semantic con- Automated learning of decision rules for text
categorization. ACM Trans. on Inform. Syst. 12,
tent of nontextual media by automatic in- 3, 233251.
dexing is still an open problem. While ATTARDI, G., DI MARCO, S., AND SALVI, D. 1998. Cat-
there are systems that attempt to detect egorization by context. J. Univers. Comput. Sci.
content, for example, in images by rec- 4, 9, 719736.
ognizing shapes, color distributions, and BAKER, L. D. AND MCCALLUM, A. K. 1998. Distribu-
texture, the general problem of image se- tional clustering of words for text classification.
In Proceedings of SIGIR-98, 21st ACM Interna-
mantics is still unsolved. The main reason tional Conference on Research and Development
is that natural language, the language of in Information Retrieval (Melbourne, Australia,
the text medium, admits far fewer vari- 1998), 96103.
ations than the languages employed by BELKIN, N. J. AND CROFT, W. B. 1992. Information
the other media. For instance, while the filtering and information retrieval: two sides
concept of a house can be triggered by of the same coin? Commun. ACM 35, 12, 29
38.
relatively few natural language expres-
BIEBRICHER, P., FUHR, N., KNORZ, G., LUSTIG, G., AND
sions such as house, houses, home, housing, SCHWANTNER, M. 1988. The automatic index-
inhabiting, etc., it can be triggered by far ing system AIR/PHYS. From research to appli-
more images: the images of all the differ- cation. In Proceedings of SIGIR-88, 11th ACM
ent houses that exist, of all possible colors International Conference on Research and De-
velopment in Information Retrieval (Grenoble,
and shapes, viewed from all possible per- France, 1988), 333342. Also reprinted in Sparck
spectives, from all possible distances, etc. Jones and Willett [1997], pp. 513517.
If we had solved the multimedia indexing BORKO, H. AND BERNICK, M. 1963. Automatic docu-
problem in a satisfactory way, the general ment classification. J. Assoc. Comput. Mach. 10,
methodology that we have discussed in 2, 151161.
this paper for text would also apply to au- CAROPRESO, M. F., MATWIN, S., AND SEBASTIANI, F.
tomated multimedia categorization, and 2001. A learner-independent evaluation of the
usefulness of statistical phrases for automated
there are reasons to believe that the ef- text categorization. In Text Databases and Doc-
fectiveness levels could be as high. This ument Management: Theory and Practice, A. G.
only adds to the common sentiment that Chin, ed. Idea Group Publishing, Hershey, PA,
more research in automated content- 78102.
based indexing for multimedia documents CAVNAR, W. B. AND TRENKLE, J. M. 1994. N-gram-
based text categorization. In Proceedings of
is needed. SDAIR-94, 3rd Annual Symposium on Docu-
ment Analysis and Information Retrieval (Las
ACKNOWLEDGMENTS Vegas, NV, 1994), 161175.
CHAKRABARTI, S., DOM, B. E., AGRAWAL, R., AND
This paper owes a lot to the suggestions and con- RAGHAVAN, P. 1998a. Scalable feature selec-
structive criticism of Norbert Fuhr and David Lewis. tion, classification and signature generation for
Thanks also to Umberto Straccia for comments on organizing large text databases into hierarchical
an earlier draft, to Evgeniy Gabrilovich, Daniela topic taxonomies. J. Very Large Data Bases 7, 3,
Giorgetti, and Alessandro Moschitti for spotting mis- 163178.
takes in an earlier draft, and to Alessandro Sperduti CHAKRABARTI, S., DOM, B. E., AND INDYK, P. 1998b.
for many fruitful discussions. Enhanced hypertext categorization using hyper-
links. In Proceedings of SIGMOD-98, ACM In-
ternational Conference on Management of Data
REFERENCES (Seattle, WA, 1998), 307318.
AMATI, G. AND CRESTANI, F. 1999. Probabilistic CLACK, C., FARRINGDON, J., LIDWELL, P., AND YU, T.
learning for selective dissemination of informa- 1997. Autonomous document classification for
tion. Inform. Process. Man. 35, 5, 633654. business. In Proceedings of the 1st International
Conference on Autonomous Agents (Marina del
ANDROUTSOPOULOS, I., KOUTSIAS, J., CHANDRINOS, K. V.,
Rey, CA, 1997), 201208.
AND SPYROPOULOS, C. D. 2000. An experimen-
tal comparison of naive Bayesian and keyword- CLEVERDON, C. 1984. Optimizing convenient on-
based anti-spam filtering with personal e-mail line access to bibliographic databases. Inform.
messages. In Proceedings of SIGIR-00, 23rd Serv. Use 4, 1, 3747. Also reprinted in Willett
ACM International Conference on Research and [1988], pp. 3241.
Development in Information Retrieval (Athens, COHEN, W. W. 1995a. Learning to classify English
Greece, 2000), 160167. text with ILP methods. In Advances in Inductive

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 43

Logic Programming, L. De Raedt, ed. IOS Press, DUMAIS, S. T., PLATT, J., HECKERMAN, D., AND SAHAMI,
Amsterdam, The Netherlands, 124143. M. 1998. Inductive learning algorithms and
COHEN, W. W. 1995b. Text categorization and rela- representations for text categorization. In Pro-
tional learning. In Proceedings of ICML-95, 12th ceedings of CIKM-98, 7th ACM International
International Conference on Machine Learning Conference on Information and Knowledge Man-
(Lake Tahoe, CA, 1995), 124132. agement (Bethesda, MD, 1998), 148155.
COHEN, W. W. AND HIRSH, H. 1998. Joins that gen- ESCUDERO, G., M`ARQUEZ, L., AND RIGAU, G. 2000.
eralize: text classification using WHIRL. In Pro- Boosting applied to word sense disambiguation.
ceedings of KDD-98, 4th International Confer- In Proceedings of ECML-00, 11th European Con-
ence on Knowledge Discovery and Data Mining ference on Machine Learning (Barcelona, Spain,
(New York, NY, 1998), 169173. 2000), 129141.
COHEN, W. W. AND SINGER, Y. 1999. Context- FIELD, B. 1975. Towards automatic indexing: auto-
sensitive learning methods for text categoriza- matic assignment of controlled-language index-
tion. ACM Trans. Inform. Syst. 17, 2, 141 ing and classification from free indexing. J. Doc-
173. ument. 31, 4, 246265.
FORSYTH, R. S. 1999. New directions in text catego-
COOPER, W. S. 1995. Some inconsistencies and mis-
nomers in probabilistic information retrieval. rization. In Causal Models and Intelligent Data
ACM Trans. Inform. Syst. 13, 1, 100111. Management, A. Gammerman, ed. Springer,
Heidelberg, Germany, 151185.
CREECY, R. M., MASAND, B. M., SMITH, S. J., AND WALTZ,
FRASCONI, P., SODA, G., AND VULLO, A. 2002. Text
D. L. 1992. Trading MIPS and memory for
categorization for multi-page documents: A
knowledge engineering: classifying census re-
hybrid naive Bayes HMM approach. J. Intell.
turns on the Connection Machine. Commun.
Inform. Syst. 18, 2/3 (MarchMay), 195217.
ACM 35, 8, 4863.
FUHR, N. 1985. A probabilistic model of dictionary-
CRESTANI, F., LALMAS, M., VAN RIJSBERGEN, C. J., AND based automatic indexing. In Proceedings of
CAMPBELL, I. 1998. Is this document rele- RIAO-85, 1st International Conference Re-
vant? . . . probably. A survey of probabilistic cherche dInformation Assistee par Ordinateur
models in information retrieval. ACM Comput. (Grenoble, France, 1985), 207216.
Surv. 30, 4, 528552.
FUHR, N. 1989. Models for retrieval with proba-
DAGAN, I., KAROV, Y., AND ROTH, D. 1997. Mistake- bilistic indexing. Inform. Process. Man. 25, 1, 55
driven learning in text categorization. In Pro- 72.
ceedings of EMNLP-97, 2nd Conference on Em-
pirical Methods in Natural Language Processing FUHR, N. AND BUCKLEY, C. 1991. A probabilistic
(Providence, RI, 1997), 5563. learning approach for document indexing. ACM
Trans. Inform. Syst. 9, 3, 223248.
DEERWESTER, S., DUMAIS, S. T., FURNAS, G. W.,
FUHR, N., HARTMANN, S., KNORZ, G., LUSTIG, G.,
LANDAUER, T. K., AND HARSHMAN, R. 1990. In-
SCHWANTNER, M., AND TZERAS, K. 1991.
dexing by latent semantic indexing. J. Amer. Soc.
AIR/Xa rule-based multistage indexing
Inform. Sci. 41, 6, 391407.
system for large subject fields. In Proceed-
DENOYER, L., ZARAGOZA, H., AND GALLINARI, P. 2001. ings of RIAO-91, 3rd International Conference
HMM-based passage models for document clas- Recherche dInformation Assistee par Ordina-
sification and ranking. In Proceedings of ECIR- teur (Barcelona, Spain, 1991), 606623.
01, 23rd European Colloquium on Information
FUHR, N. AND KNORZ, G. 1984. Retrieval test
Retrieval Research (Darmstadt, Germany, 2001).
evaluation of a rule-based automated index-
DIAZ ESTEBAN, A., DE BUENAGA RODRIGUEZ, M., URENA ing (AIR/PHYS). In Proceedings of SIGIR-84,

LOPEZ , L. A., AND GARCIA VEGA, M. 1998. In- 7th ACM International Conference on Research
tegrating linguistic resources in an uniform and Development in Information Retrieval
way for text classification tasks. In Proceed- (Cambridge, UK, 1984), 391408.
ings of LREC-98, 1st International Conference on FUHR, N. AND PFEIFER, U. 1994. Probabilistic in-
Language Resources and Evaluation (Grenada, formation retrieval as combination of abstrac-
Spain, 1998), 11971204. tion inductive learning and probabilistic as-
DOMINGOS, P. AND PAZZANI, M. J. 1997. On the the sumptions. ACM Trans. Inform. Syst. 12, 1,
optimality of the simple Bayesian classifier un- 92115.
der zero-one loss. Mach. Learn. 29, 23, 103130.
FURNKRANZ , J. 1999. Exploiting structural infor-
DRUCKER, H., VAPNIK, V., AND WU, D. 1999. Auto- mation for text classification on the WWW.
matic text categorization and its applications to In Proceedings of IDA-99, 3rd Symposium on
text retrieval. IEEE Trans. Neural Netw. 10, 5, Intelligent Data Analysis (Amsterdam, The
10481054. Netherlands, 1999), 487497.
DUMAIS, S. T. AND CHEN, H. 2000. Hierarchical clas- GALAVOTTI, L., SEBASTIANI, F., AND SIMI, M. 2000.
sification of Web content. In Proceedings of Experiments on the use of feature selec-
SIGIR-00, 23rd ACM International Conference tion and negative evidence in automated text
on Research and Development in Information categorization. In Proceedings of ECDL-00,
Retrieval (Athens, Greece, 2000), 256263. 4th European Conference on Research and

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


44 Sebastiani

Advanced Technology for Digital Libraries Knowledge Management (McLean, VA, 2000),
(Lisbon, Portugal, 2000), 5968. 7077.
GALE, W. A., CHURCH, K. W., AND YAROWSKY, D. 1993. JOACHIMS, T. 1997. A probabilistic analysis of the
A method for disambiguating word senses in a Rocchio algorithm with TFIDF for text cat-
large corpus. Comput. Human. 26, 5, 415439. egorization. In Proceedings of ICML-97, 14th

GOVERT , N., LALMAS, M., AND FUHR, N. 1999. A International Conference on Machine Learning
probabillistic description-oriented approach for (Nashville, TN, 1997), 143151.
categorising Web documents. In Proceedings of JOACHIMS, T. 1998. Text categorization with sup-
CIKM-99, 8th ACM International Conference port vector machines: learning with many rel-
on Information and Knowledge Management evant features. In Proceedings of ECML-98,
(Kansas City, MO, 1999), 475482. 10th European Conference on Machine Learning
GRAY, W. A. AND HARLEY, A. J. 1971. Computer- (Chemnitz, Germany, 1998), 137142.
assisted indexing. Inform. Storage Retrieval 7, JOACHIMS, T. 1999. Transductive inference for text
4, 167174. classification using support vector machines. In
GUTHRIE, L., WALKER, E., AND GUTHRIE, J. A. 1994. Proceedings of ICML-99, 16th International Con-
Document classification by machine: theory ference on Machine Learning (Bled, Slovenia,
and practice. In Proceedings of COLING-94, 15th 1999), 200209.
International Conference on Computational Lin- JOACHIMS, T. AND SEBASTIANI, F. 2002. Guest editors
guistics (Kyoto, Japan, 1994), 10591063. introduction to the special issue on automated
HAYES, P. J., ANDERSEN, P. M., NIRENBURG, I. B., text categorization. J. Intell. Inform. Syst. 18, 2/3
AND SCHMANDT, L. M. 1990. Tcs: a shell for (March-May), 103105.
content-based text categorization. In Proceed- JOHN, G. H., KOHAVI, R., AND PFLEGER, K. 1994. Ir-
ings of CAIA-90, 6th IEEE Conference on Arti- relevant features and the subset selection prob-
ficial Intelligence Applications (Santa Barbara, lem. In Proceedings of ICML-94, 11th Interna-
CA, 1990), 320326. tional Conference on Machine Learning (New
HEAPS, H. 1973. A theory of relevance for au- Brunswick, NJ, 1994), 121129.
tomatic document classification. Inform. Con- JUNKER, M. AND ABECKER, A. 1997. Exploiting the-
trol 22, 3, 268278. saurus knowledge in rule induction for text clas-
HERSH, W., BUCKLEY, C., LEONE, T., AND HICKMAN, D. sification. In Proceedings of RANLP-97, 2nd In-
1994. OHSUMED: an interactive retrieval evalu- ternational Conference on Recent Advances in
ation and new large text collection for research. Natural Language Processing (Tzigov Chark,
In Proceedings of SIGIR-94, 17th ACM Interna- Bulgaria, 1997), 202207.
tional Conference on Research and Development JUNKER, M. AND HOCH, R. 1998. An experimen-
in Information Retrieval (Dublin, Ireland, 1994), tal evaluation of OCR text representations for
192201. learning document classifiers. Internat. J. Docu-
HULL, D. A. 1994. Improving text retrieval for the ment Analysis and Recognition 1, 2, 116122.
routing problem using latent semantic indexing.
KESSLER, B., NUNBERG, G., AND SCHUTZE , H. 1997.
In Proceedings of SIGIR-94, 17th ACM Interna-
Automatic detection of text genre. In Proceed-
tional Conference on Research and Development
ings of ACL-97, 35th Annual Meeting of the Asso-
in Information Retrieval (Dublin, Ireland, 1994),
ciation for Computational Linguistics (Madrid,
282289.
Spain, 1997), 3238.

HULL, D. A., PEDERSEN, J. O., AND SCHUTZE , H. 1996.
Method combination for document filtering. In KIM, Y.-H., HAHN, S.-Y., AND ZHANG, B.-T. 2000. Text
Proceedings of SIGIR-96, 19th ACM Interna- filtering by boosting naive Bayes classifiers. In
tional Conference on Research and Development Proceedings of SIGIR-00, 23rd ACM Interna-

in Information Retrieval (Zurich, Switzerland, tional Conference on Research and Development
1996), 279288. in Information Retrieval (Athens, Greece, 2000),
168175.
ITTNER, D. J., LEWIS, D. D., AND AHN, D. D. 1995.
Text categorization of low quality images. In KLINKENBERG, R. AND JOACHIMS, T. 2000. Detect-
Proceedings of SDAIR-95, 4th Annual Sympo- ing concept drift with support vector machines.
sium on Document Analysis and Information In Proceedings of ICML-00, 17th International
Retrieval (Las Vegas, NV, 1995), 301315. Conference on Machine Learning (Stanford, CA,
2000), 487494.
IWAYAMA, M. AND TOKUNAGA, T. 1995. Cluster-based
text categorization: a comparison of category KNIGHT, K. 1999. Mining online text. Commun.
search strategies. In Proceedings of SIGIR-95, ACM 42, 11, 5861.
18th ACM International Conference on Research KNORZ, G. 1982. A decision theory approach to
and Development in Information Retrieval optimal automated indexing. In Proceedings of
(Seattle, WA, 1995), 273281. SIGIR-82, 5th ACM International Conference
IYER, R. D., LEWIS, D. D., SCHAPIRE, R. E., SINGER, Y., on Research and Development in Information
AND SINGHAL, A. 2000. Boosting for document Retrieval (Berlin, Germany, 1982), 174193.
routing. In Proceedings of CIKM-00, 9th ACM KOLLER, D. AND SAHAMI, M. 1997. Hierarchically
International Conference on Information and classifying documents using very few words. In

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 45

Proceedings of ICML-97, 14th International Con- LEWIS, D. D. 1995b. A sequential algorithm for
ference on Machine Learning (Nashville, TN, training text classifiers: corrigendum and addi-
1997), 170178. tional data. SIGIR Forum 29, 2, 1319.
KORFHAGE, R. R. 1997. Information Storage and LEWIS, D. D. 1995c. The TREC-4 filtering track:
Retrieval. Wiley Computer Publishing, New description and analysis. In Proceedings
York, NY. of TREC-4, 4th Text Retrieval Conference
LAM, S. L. AND LEE, D. L. 1999. Feature reduc- (Gaithersburg, MD, 1995), 165180.
tion for neural network based text categoriza- LEWIS, D. D. 1998. Naive (Bayes) at forty: The
tion. In Proceedings of DASFAA-99, 6th IEEE independence assumption in information re-
International Conference on Database Advanced trieval. In Proceedings of ECML-98, 10th
Systems for Advanced Application (Hsinchu, European Conference on Machine Learning
Taiwan, 1999), 195202. (Chemnitz, Germany, 1998), 415.
LAM, W. AND HO, C. Y. 1998. Using a generalized LEWIS, D. D. AND CATLETT, J. 1994. Heterogeneous
instance set for automatic text categorization. uncertainty sampling for supervised learning. In
In Proceedings of SIGIR-98, 21st ACM Interna- Proceedings of ICML-94, 11th International Con-
tional Conference on Research and Development ference on Machine Learning (New Brunswick,
in Information Retrieval (Melbourne, Australia, NJ, 1994), 148156.
1998), 8189. LEWIS, D. D. AND GALE, W. A. 1994. A sequential
LAM, W., LOW, K. F., AND HO, C. Y. 1997. Using a algorithm for training text classifiers. In Pro-
Bayesian network induction approach for text ceedings of SIGIR-94, 17th ACM International
categorization. In Proceedings of IJCAI-97, 15th Conference on Research and Development in
International Joint Conference on Artificial In- Information Retrieval (Dublin, Ireland, 1994),
telligence (Nagoya, Japan, 1997), 745750. 312. See also Lewis [1995b].
LAM, W., RUIZ, M. E., AND SRINIVASAN, P. 1999. Auto- LEWIS, D. D. AND HAYES, P. J. 1994. Guest editorial
matic text categorization and its applications to for the special issue on text categorization. ACM
text retrieval. IEEE Trans. Knowl. Data Engin. Trans. Inform. Syst. 12, 3, 231.
11, 6, 865879.
LEWIS, D. D. AND RINGUETTE, M. 1994. A compar-
LANG, K. 1995. NEWSWEEDER: learning to filter net- ison of two learning algorithms for text cat-
news. In Proceedings of ICML-95, 12th Interna- egorization. In Proceedings of SDAIR-94, 3rd
tional Conference on Machine Learning (Lake Annual Symposium on Document Analysis and
Tahoe, CA, 1995), 331339. Information Retrieval (Las Vegas, NV, 1994),
LARKEY, L. S. 1998. Automatic essay grading us- 8193.
ing text categorization techniques. In Pro- LEWIS, D. D., SCHAPIRE, R. E., CALLAN, J. P., AND PAPKA,
ceedings of SIGIR-98, 21st ACM International R. 1996. Training algorithms for linear text
Conference on Research and Development in classifiers. In Proceedings of SIGIR-96, 19th
Information Retrieval (Melbourne, Australia, ACM International Conference on Research and
1998), 9095. Development in Information Retrieval (Zurich,
LARKEY, L. S. 1999. A patent search and classifica- Switzerland, 1996), 298306.
tion system. In Proceedings of DL-99, 4th ACM
LI, H. AND YAMANISHI, K. 1999. Text classification
Conference on Digital Libraries (Berkeley, CA,
using ESC-based stochastic decision lists. In
1999), 179187.
Proceedings of CIKM-99, 8th ACM International
LARKEY, L. S. AND CROFT, W. B. 1996. Combining Conference on Information and Knowledge Man-
classifiers in text categorization. In Proceedings agement (Kansas City, MO, 1999), 122130.
of SIGIR-96, 19th ACM International Conference
LI, Y. H. AND JAIN, A. K. 1998. Classification of text
on Research and Development in Information
documents. Comput. J. 41, 8, 537546.

Retrieval (Zurich, Switzerland, 1996), 289297.
LEWIS, D. D. 1992a. An evaluation of phrasal and LIDDY, E. D., PAIK, W., AND YU, E. S. 1994. Text cat-
clustered representations on a text categoriza- egorization for multiple users based on seman-
tion task. In Proceedings of SIGIR-92, 15th ACM tic features from a machine-readable dictionary.
International Conference on Research and Devel- ACM Trans. Inform. Syst. 12, 3, 278295.
opment in Information Retrieval (Copenhagen, LIERE, R. AND TADEPALLI, P. 1997. Active learning
Denmark, 1992), 3750. with committees for text categorization. In Pro-
LEWIS, D. D. 1992b. Representation and Learn- ceedings of AAAI-97, 14th Conference of the
ing in Information Retrieval. Ph. D. thesis, De- American Association for Artificial Intelligence
partment of Computer Science, University of (Providence, RI, 1997), 591596.
Massachusetts, Amherst, MA. LIM, J. H. 1999. Learnable visual keywords for im-
LEWIS, D. D. 1995a. Evaluating and optmizing au- age classification. In Proceedings of DL-99, 4th
tonomous text classification systems. In Pro- ACM Conference on Digital Libraries (Berkeley,
ceedings of SIGIR-95, 18th ACM International CA, 1999), 139145.
Conference on Research and Development in MANNING, C. AND SCHUTZE , H. 1999. Foundations of
Information Retrieval (Seattle, WA, 1995), 246 Statistical Natural Language Processing. MIT
254. Press, Cambridge, MA.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


46 Sebastiani

MARON, M. 1961. Automatic indexing: an experi- Information Retrieval (Philadelphia, PA, 1997),
mental inquiry. J. Assoc. Comput. Mach. 8, 3, 6773.
404417. NIGAM, K., MCCALLUM, A. K., THRUN, S., AND MITCHELL,
MASAND, B. 1994. Optimising confidence of text T. M. 2000. Text classification from labeled
classification by evolution of symbolic expres- and unlabeled documents using EM. Mach.
sions. In Advances in Genetic Programming, Learn. 39, 2/3, 103134.
K. E. Kinnear, ed. MIT Press, Cambridge, MA, OH, H.-J., MYAENG, S. H., AND LEE, M.-H. 2000. A
Chapter 21, 459476. practical hypertext categorization method using
MASAND, B., LINOFF, G., AND WALTZ, D. 1992. Clas- links and incrementally available class informa-
sifying news stories using memory-based rea- tion. In Proceedings of SIGIR-00, 23rd ACM In-
soning. In Proceedings of SIGIR-92, 15th ACM ternational Conference on Research and Develop-
International Conference on Research and Devel- ment in Information Retrieval (Athens, Greece,
opment in Information Retrieval (Copenhagen, 2000), 264271.
Denmark, 1992), 5965. PAZIENZA, M. T., ed. 1997. Information Extraction.
MCCALLUM, A. K. AND NIGAM, K. 1998. Employ- Lecture Notes in Computer Science, Vol. 1299.
ing EM in pool-based active learning for text Springer, Heidelberg, Germany.
classification. In Proceedings of ICML-98, 15th RILOFF. E. 1995. Little words can make a big dif-
International Conference on Machine Learning ference for text classification. In Proceedings of
(Madison, WI, 1998), 350358. SIGIR-95, 18th ACM International Conference
MCCALLUM, A. K., ROSENFELD, R., MITCHELL, T. M., AND on Research and Development in Information
NG, A. Y. 1998. Improving text classification Retrieval (Seattle, WA, 1995), 130136.
by shrinkage in a hierarchy of classes. In Pro- RILOFF, E. AND LEHNERT, W. 1994. Information ex-
ceedings of ICML-98, 15th International Confer- traction as a basis for high-precision text classifi-
ence on Machine Learning (Madison, WI, 1998), cation. ACM Trans. Inform. Syst. 12, 3, 296333.
359367.
ROBERTSON, S. E. AND HARDING, P. 1984. Probabilis-
MERKL, D. 1998. Text classification with self- tic automatic indexing by learning from human
organizing maps: Some lessons learned. Neuro- indexers. J. Document. 40, 4, 264270.
computing 21, 1/3, 6177.
ROBERTSON, S. E. AND SPARCK JONES, K. 1976. Rel-
MITCHELL, T. M. 1996. Machine Learning. McGraw
evance weighting of search terms. J. Amer. Soc.
Hill, New York, NY.
Inform. Sci. 27, 3, 129146. Also reprinted in
MLADENIC , D. 1998. Feature subset selection in Willett [1988], pp. 143160.
text learning. In Proceedings of ECML-98,
ROTH, D. 1998. Learning to resolve natural
10th European Conference on Machine Learning
language ambiguities: a unified approach. In
(Chemnitz, Germany, 1998), 95100.
Proceedings of AAAI-98, 15th Conference of the
MLADENIC , D. AND GROBELNIK, M. 1998. Word se- American Association for Artificial Intelligence
quences as features in text-learning. In Pro- (Madison, WI, 1998), 806813.
ceedings of ERK-98, the Seventh Electrotechni-
RUIZ, M. E. AND SRINIVASAN, P. 1999. Hierarchical
cal and Computer Science Conference (Ljubljana,
Slovenia, 1998), 145148. neural networks for text categorization. In Pro-
ceedings of SIGIR-99, 22nd ACM International
MOULINIER, I. AND GANASCIA, J.-G. 1996. Applying Conference on Research and Development in
an existing machine learning algorithm to text Information Retrieval (Berkeley, CA, 1999),
categorization. In Connectionist, Statistical, 281282.
and Symbolic Approaches to Learning for Nat-
ural Language Processing, S. Wermter, E. Riloff, SABLE, C. L. AND HATZIVASSILOGLOU, V. 2000. Text-
and G. Schaler, eds. Springer Verlag, Heidelberg, based approaches for non-topical image catego-
Germany, 343354. rization. Internat. J. Dig. Libr. 3, 3, 261275.
MOULINIER, I., RAS KINIS, G., AND GANASCIA, J.-G. 1996. SALTON, G. AND BUCKLEY, C. 1988. Term-weighting
Text categorization: a symbolic approach. In approaches in automatic text retrieval. Inform.
Proceedings of SDAIR-96, 5th Annual Sympo- Process. Man. 24, 5, 513523. Also reprinted in
sium on Document Analysis and Information Sparck Jones and Willett [1997], pp. 323328.
Retrieval (Las Vegas, NV, 1996), 8799. SALTON, G., WONG, A., AND YANG, C. 1975. A vector
MYERS, K., KEARNS, M., SINGH, S., AND WALKER, space model for automatic indexing. Commun.
M. A. 2000. A boosting approach to topic ACM 18, 11, 613620. Also reprinted in Sparck
spotting on subdialogues. In Proceedings of Jones and Willett [1997], pp. 273280.
ICML-00, 17th International Conference on Ma- SARACEVIC, T. 1975. Relevance: a review of and
chine Learning (Stanford, CA, 2000), 655 a framework for the thinking on the notion in
662. information science. J. Amer. Soc. Inform. Sci.
NG, H. T., GOH, W. B., AND LOW, K. L. 1997. Fea- 26, 6, 321343. Also reprinted in Sparck Jones
ture selection, perceptron learning, and a us- and Willett [1997], pp. 143165.
ability case study for text categorization. In Pro- SCHAPIRE, R. E. AND SINGER, Y. 2000. BoosTexter:
ceedings of SIGIR-97, 20th ACM International a boosting-based system for text categorization.
Conference on Research and Development in Mach. Learn. 39, 2/3, 135168.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.


Machine Learning in Automated Text Categorization 47

SCHAPIRE, R. E., SINGER, Y., AND SINGHAL, A. 1998. VAN RIJSBERGEN, C. J. 1977. A theoretical basis for
Boosting and Rocchio applied to text filtering. the use of co-occurrence data in information
In Proceedings of SIGIR-98, 21st ACM Interna- retrieval. J. Document. 33, 2, 106119.
tional Conference on Research and Development VAN RIJSBERGEN, C. J. 1979. Information Retrieval,
in Information Retrieval (Melbourne, Australia, 2nd ed. Butterworths, London, UK. Available at
1998), 215223. http://www.dcs.gla.ac.uk/Keith.

SCHUTZE , H. 1998. Automatic word sense discrimina- WEIGEND, A. S., WIENER, E. D., AND PEDERSEN, J. O.
tion. Computat. Ling. 24, 1, 97124. 1999. Exploiting hierarchy in text catagoriza-

SCHUTZE , H., HULL, D. A., AND PEDERSEN, J. O. 1995. tion. Inform. Retr. 1, 3, 193216.
A comparison of classifiers and document repre- WEISS, S. M., APTE , C., DAMERAU, F. J., JOHNSON, D.
sentations for the routing problem. In Proceed- E., OLES, F. J., GOETZ, T., AND HAMPP, T. 1999.
ings of SIGIR-95, 18th ACM International Con- Maximizing text-mining performance. IEEE
ference on Research and Development in Infor- Intell. Syst. 14, 4, 6369.
mation Retrieval (Seattle, WA, 1995), 229237.
WIENER, E. D., PEDERSEN, J. O., AND WEIGEND, A. S.
SCOTT, S. AND MATWIN, S. 1999. Feature engineer- 1995. A neural network approach to topic spot-
ing for text classification. In Proceedings of ting. In Proceedings of SDAIR-95, 4th Annual
ICML-99, 16th International Conference on Ma- Symposium on Document Analysis and Informa-
chine Learning (Bled, Slovenia, 1999), 379388. tion Retrieval (Las Vegas, NV, 1995), 317332.
SEBASTIANI, F., SPERDUTI, A., AND VALDAMBRINI, N. WILLETT, P., ed. 1988. Document Retrieval Sys-
2000. An improved boosting algorithm and its tems. Taylor Graham, London, UK.
application to automated text categorization. In
Proceedings of CIKM-00, 9th ACM International WONG, J. W., KAN, W.-K., AND YOUNG, G. H. 1996.
Conference on Information and Knowledge ACTION: automatic classification for full-text
Management (McLean, VA, 2000), 7885. documents. SIGIR Forum 30, 1, 2641.
SINGHAL, A., MITRA, M., AND BUCKLEY, C. 1997. YANG, Y. 1994. Expert network: effective and
Learning routing queries in a query zone. In efficient learning from human decisions in text
Proceedings of SIGIR-97, 20th ACM Interna- categorisation and retrieval. In Proceedings of
tional Conference on Research and Development SIGIR-94, 17th ACM International Conference
in Information Retrieval (Philadelphia, PA, on Research and Development in Information
1997), 2532. Retrieval (Dublin, Ireland, 1994), 1322.
SINGHAL, A., SALTON, G., MITRA, M., AND BUCKLEY, YANG, Y. 1995. Noise reduction in a statistical ap-
C. 1996. Document length normalization. proach to text categorization. In Proceedings of
Inform. Process. Man. 32, 5, 619633. SIGIR-95, 18th ACM International Conference
on Research and Development in Information
SLONIM, N. AND TISHBY, N. 2001. The power of word
Retrieval (Seattle, WA, 1995), 256263.
clusters for text classification. In Proceedings
of ECIR-01, 23rd European Colloquium on YANG, Y. 1999. An evaluation of statistical ap-
Information Retrieval Research (Darmstadt, proaches to text categorization. Inform. Retr. 1,
Germany, 2001). 12, 6990.
SPARCK JONES, K. AND WILLETT, P., eds. 1997. YANG, Y. AND CHUTE, C. G. 1994. An example-based
Readings in Information Retrieval. Morgan mapping method for text categorization and re-
Kaufmann, San Mateo, CA. trieval. ACM Trans. Inform. Syst. 12, 3, 252277.
TAIRA, H. AND HARUNO, M. 1999. Feature selection YANG, Y. AND LIU, X. 1999. A re-examination of
in SVM text categorization. In Proceedings text categorization methods. In Proceedings of
of AAAI-99, 16th Conference of the American SIGIR-99, 22nd ACM International Conference
Association for Artificial Intelligence (Orlando, on Research and Development in Information
FL, 1999), 480486. Retrieval (Berkeley, CA, 1999), 4249.
TAURITZ, D. R., KOK, J. N., AND SPRINKHUIZEN-KUYPER, YANG, Y. AND PEDERSEN, J. O. 1997. A comparative
I. G. 2000. Adaptive information filtering study on feature selection in text categorization.
using evolutionary computation. Inform. Sci. In Proceedings of ICML-97, 14th International
122, 24, 121140. Conference on Machine Learning (Nashville,
TUMER, K. AND GHOSH, J. 1996. Error correlation TN, 1997), 412420.
and error reduction in ensemble classifiers. YANG, Y., SLATTERY, S., AND GHANI, R. 2002. A study
Connection Sci. 8, 3-4, 385403. of approaches to hypertext categorization. J. In-
TZERAS, K. AND HARTMANN, S. 1993. Automatic tell. Inform. Syst. 18, 2/3 (March-May), 219241.
indexing based on Bayesian inference networks. YU, K. L. AND LAM, W. 1998. A new on-line learn-
In Proceedings of SIGIR-93, 16th ACM Interna- ing algorithm for adaptive text filtering. In
tional Conference on Research and Development Proceedings of CIKM-98, 7th ACM International
in Information Retrieval (Pittsburgh, PA, 1993), Conference on Information and Knowledge
2234. Management (Bethesda, MD, 1998), 156160.

Received December 1999; revised February 2001; accepted July 2001

ACM Computing Surveys, Vol. 34, No. 1, March 2002.