Comparison of Different Dimensionality Reduction Methods For Information Retrieval and Text Mining

UNIVERSITY OF ZAGREB
FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING
GRADUATE THESIS No. 1599
Comparison of different dimensionality

reduction methods for information retrieval
and text mining
Goran Jovanov
Zagreb, July, 2006.

Sincere and honor thank to my
mentor, prof. dr. sc. Bojana Dalbelo
Bašić, for professional guiding and
advising through my graduate thesis
and for the contribution of my personal
and professional development.
I also want to thank to mr. sc.

Jasminka Dobša for implementation
assistance of the fuzzy k-means
clustering algorithm.
1
CONTENT
1. INTRODUCTION ....................................................................................................... 3
2. INFORMATION RETRIEVAL ................................................................................ 5
2.1. VECTOR SPACE MODEL ................................................................................... 7

2.2. VECTOR SPACE MODEL EXAMPLES ........................................................... 10
2.3. EVALUATION .................................................................................................... 16
2.4. EXAMPLE OF DOCUMENTS AND QUERIES ............................................... 19
3. CLUSTERING .......................................................................................................... 21
4. DIMENSIONALITY REDUCTION ....................................................................... 24
4.1. LATENT SEMANTIC INDEXING .................................................................... 26

4.2. CONCEPT INDEXING....................................................................................... 29
4.2.1. SPHERICAL K-MEANS ............................................................................................ 31

4.2.2. FUZZY K-MEANS..................................................................................................... 35
4.3. LSI AND CI COMPARISON EXAMPLE .......................................................... 37

4.4. LSI FOLDING-IN ALGORITHM ...................................................................... 42
5. SYSTEM IMPLEMENTATION ............................................................................. 44
6. EXPERIMENTS AND RESULTS .......................................................................... 56
6.1. DECOMPOSITION ERROR .............................................................................. 57

6.2. ORTHONORMALITY OF CONCEPT VECTORS ............................................ 59
6.3. CI PERFORMANCES ........................................................................................ 61
6.4. INFORMATION RETRIEVAL EVALUATION................................................ 67
6.4.1. WITHOUT FOLDING-IN DOCUMENTS ................................................................ 68

6.4.2. WITH FOLDING-IN DOCUMENTS ........................................................................ 78
7. RELATED WORKS ................................................................................................. 82
8. CONCLUSIONS ....................................................................................................... 84
9. REFERENCES .......................................................................................................... 85
TABLES INDEX .................................................................................................................... 89
FIGURES INDEX .................................................................................................................. 90
2
1. INTRODUCTION
Large collections of documents are becoming increasingly common. The public

Internet currently has more than 1.5 billion web pages, while private intranets also
contain an abundance of text data. A vast amount of important scientific data
appears as technical abstracts and papers. Given such large document collections it
is important to organize them into structured ontologies. This organization facilitates
navigation and search, and at the same time provides a framework for continual
maintenance as document repositories grow in size.
Manual construction of structured ontologies is one possible solution and has

been adopted to organize the internet (www.yahoo.com) and to structure library
content. However, this process has the obvious disadvantage of being too labor
intensive, and is viable only in large corporations. Thus it is desirable to seek
automatic methods for organizing unlabeled document collections. Given a collection
of unlabeled data points, clustering refers to the problem of automatically assigning
class labels to the data and has been widely studied in statistical pattern recognition
and machine learning. Therefore, it increases interest in methods that allow users to
quickly and accurately retrieve and organize these types of information. Thus
disciplines like text mining and information retrieval have many challenges.
Text mining is a constituent discipline of data mining, which is content-based and

operates with unstructured text documents, extracting useful information ([17], [18]).
Content based operating implies operating solely with the document content and not
using metadata. Text mining comprises many different methods such as document
clustering, categorization and automatic document indexing, which is categorization
sub form.
Information retrieval discipline deals with presentation, storage and information

organization and access methods (see [1], [2], [8], [21], [24]). Furthermore,
information retrieval performs text document collection interpretation, which is an
abstraction of the syntax and semantic information contained in the text collections.
The principal objective of the information retrieval system is for a certain query to
3
retrieve all of the relevant documents and the smallest amount of irrelevant
documents. Unfortunately, due to problems such as polysemy (words with multiple
meanings) and synonymy (different words that have the same meaning), a list of
documents retrieved for a given query is almost never perfect, and the user has to
ignore some of the items.
Although there are many other models (see Information retrieval chapter), the
algorithms we are dealing with are embedded in the vector space model. The
documents are represented as vectors of term frequencies and the documents set is
represented as a matrix of document vectors. One of the main problems in
information retrieval with the vector space model of documents is the high
dimensionality of the document-term matrix. The number of documents in document
collections may vary from a few thousand to several hundred thousand, and the
number of terms is often more than a few thousand terms. Hence, reduction of
dimensionality appears to be very useful.
There are many methods for dimensionality reduction, but the most used are
Latent Semantic Indexing (LSI) (see [1], [2], [3], [15], [20], [25]), which is based on
Singular Value Decomposition (SVD), and Concept Indexing (CI) (see [4], [5], [6], [7],
[9]), which is based on Concept Decomposition (CD) using clustering (see Clustering
chapter) k-means algorithms. The comparison of these two methods ([8]) is the main
objective of this thesis.
4
2. INFORMATION RETRIEVAL
The machine learning approach ([18], [22]) to classifier construction heavily relies on
the basic machinery of information retrieval. The reason is that both information
retrieval and document categorization are content-based document management
tasks, and therefore share many characteristics. Information retrieval techniques are
used in three phases of the classification task:
1. IR-style indexing is always (uniformly, i.e. by means of the same technique)

performed on the documents of the initial corpus and on those to be
categorized during the operating phase of the classifier;
2. IR-style techniques (such as document-request matching, query expansion,...)

are typically used in the inductive construction of the classifiers;
3. IR-style evaluation of the effectiveness of the classifiers is performed.
Document preprocessing is necessary before creating any information retrieval

model. Therefore, the first preprocessing algorithms are described as follows:
(1) lexical text processing (eliminating punctuation marks and numbers, ignoring
case)
(2) eliminating non-content-bearing words such as conjunctions, prepositions and
any similar words which generally have low semantic value in text exploration
(so called stop words)
(3) reducing words to their basic form such as stemming or lemmatization
(4) index term selection, e.g. preferring nouns and eliminating other forms of
those words
(5) construction and usage of the synonym associative term sets glossary called
thesaurus
5
In the following section different models of information retrieval are presented and
discussed, with regard to the initially postulated presumption. An information retrieval
model can be precisely defined in the following way:
Definition 2.1 [30] Information retrieval model is an ordered quadrille (Dr, Qr, F, g(q,
a)) where:
- Dr is representation of a document set,
- Qr is representation of a query set,
- F is set of rules for document and query presentation modeling and the
relationship between them,
- g(q, a) : Qr x Dr  R is real function which defines the document order by
relevance of certain query, also called decision function.
Three classical models for the information retrieval discipline are: probabilistic,
logic and vector space. In the probabilistic model the set of rules for modeling
document and query representation is based on the probabilistic theory. Moreover, in
the logic model, documents and queries are represented with an index term set, so
this model is based on the set theory. And finally in the vector space model
documents and queries are represented as multidimensional vectors, thus this model
is linear algebra based. Algorithms used in this thesis are embedded in a vector
space model.
6
2.1. VECTOR SPACE MODEL
Today, the vector space model is the most popular in the information retrieval
discipline. Unlike in other models index terms in documents and queries are assigned
with weights (real interval values) and the similarity measure value obtains value from
[0, 1] interval.
Document representation in a vector space model is often called a bag of words

representation. This representation is alluding on presumption of index term
independence and information loss about the relationship between terms.
Let T = {t1, t2, .., tm} denote index terms set and D = {d1, d2, ..., dn} the documents
set from the documents collection. Furthermore, let a ij denote weights that are
assigned to pair (ti, dj), for i = 1, 2, ..., m and j = 1, 2, ..., n. The weight values are real
and positive. Index terms in the representation of query q are also assigned with
weights. Let with qi, for i = 1, 2, ..., m be denoted weights assigned to the (ti, q) pair.
Definition 2.2 [31] In vector space model documents dj, j = 1, 2, .., n are represented
with vectors form as aj = (a1j, a2j, ..., amj)T, and queries as q = (q1, q2, ..., qm)T.
Document collection D is represented with document-term matrix form A = [aij] = [a1
a2 ... an] (see Figure 3.1).
The column of the document-term matrix represents one document from

document collection and the rows represent index term (term weight in document
collection).
7
 a11  a1n   t 1
  
 
A   aij    ti
 
  
a m1  a mn   t m
  
a1  a j  a n
Figure 2.1 Document-term matrix
Definition 2.3 [31] Similarity measure between document aj and query q is defined
as cosine of the angle between their vector representations
aj  q
T a ij qi
sim(d j , q)  cos((a j , q))   i 1
|| a j || || q || m m (2. 1)
a  q
2 2
ij i
i 1 i 1
where ||aj|| and ||q|| are Euclidian norms of the vector’s representation of documents
and queries.
Since aij and qi values are positive, similarity measure value is from the [0, 1]
interval. Similarity measure values close to 1 mean better matching between
documents dj from document collection and query q. In practice similarity limit t is
often defined where retrieved documents dj for certain query q, have similarity
measure values from [t, 1] interval. Another approach for filtering and ranking
retrieved documents is to sort descending documents by similarity measure value
and to retrieve only k documents with best similarity measure values.
8
The above preprocessing scheme yields the number of occurrences of word j
in document i, say, fji, and the number of documents which contain the word j, say, dj.
Using these counts, we now create n document vectors in Rd, namely, a1, a2,..., an as
follows. For 1 ≤ j ≤ d, set the j-th component of document vector xi, 1 ≤ i ≤ n, to be the
product of three terms
xji = tji x gj x si (2. 2)
where tji is the term weighting component and depends only on fji, gj is the global
weighting component and depends on dj, and si is the normalization component for
ai. Intuitively, tji captures the relative importance of a word in a document, while gj
captures the overall importance of a word in the entire set of documents. The
objective of such weighting schemes is to enhance discrimination between various
document vectors and to enhance retrieval effectiveness.
There are many schemes for selecting the term, global, and normalization
components, for example, ([4], [7], [17], [18]) presents 5, 5, and 2 schemes,
respectively, for the term, global, and normalization components-a total of 5 x 5 x 2 =
50 choices. From this extensive set, we will use two popular schemes denoted as txn
and tfn, and known, respectively, as normalized term frequency and normalized
term frequency-inverse document frequency. Both schemes emphasize words
with higher frequencies, and use tji = fji. The txn scheme uses gj = 1, while the tfn
scheme emphasizes words with low overall collection frequency and uses formula
2.3 (dj – number of documents in which index term occurs; n – number of documents
in document collection). In both schemes, each document vector is normalized to
have unit L2 norm, that is,
n n
g j  log
dj
(2. 3) si   (t
j 1
ji g j ) 2 (2. 4)
Intuitively, the effect of normalization is to retain only the direction of the

document vectors. This ensures that documents dealing with the same subject matter
(that is, using similar words), but differing in length lead to similar document vectors.
9
2.2. VECTOR SPACE MODEL EXAMPLES
Example 2.1 [8] The document collection in this example is composed of 15 books or
article titles divided into two clusters. The first cluster is composed of 9 Data mining
(DM) documents, the second cluster contains 5 documents related to linear algebra
(LA) and document D6 (matrices, vector spaces, and information retrieval) is a
combination of both mentioned disciplines (data mining and linear algebra). The
index terms list is formed in three steps:
1. considered only terms that occur at least in two documents

2. „stop words“ are eliminated (conjunctions, definite articles and similar words
without any semantic for information retrieval)
3. word variations are mapped into their basic form; e.g. word matrices is
mapped into index term matrix. Furthermore, words applications and applied
are mapped into index term application.
As a result we get the following index term list: 1) text, 2) mining, 3) clustering, 4)
classification, 5) retrieval, 6) analysis, 7) information, 8) linear, 9) algebra, 10) matrix,
11) application, 12) document, 13) vector, 14) space, 15) data and 16) algorithm.
Documents and their categorization are shown in Table 3.1. Two queries shall be
presented in order to illustrate the information retrieval process:
 Q1: Data mining

 Q2: Using linear algebra for data mining
Relevant documents for query Q1 are DM and with both categories categorized
documents, whereas only document D6 is relevant for query Q2. In most of the
relevant documents for Q1 are not contained index terms contained in Q1 (but those
documents contain index term such as clustering, classification, information, retrieval,
which are relevant for DM discipline). In D6, which is relevant for Q2, is not contained
any index term contained in Q2.
10
Label Category Document
Survey of text mining: clustering, classification,

D1 DM
and retrieval
Automatic text processing: the transformation
D2 DM
analysis and retrieval of information by computer
D3 LA Elementary linear algebra: A matrix approach
Matrix algebra and its applications in statistics and
D4 LA
econometrics
Effective databases for text and document
D5 DM
management
D6 Both Matrices, vector spaces, and information retrieval
D7 LA Matrix analysis and applied linear algebra
D8 LA Topological vector spaces and algebras
D9 DM Information retrieval: data structures and algorithms
D10 LA Vector spaces and algebras for chemistry and physics
D11 DM Classification, clustering and data analysis
D12 DM Clustering of large data sets
D13 DM Clustering algorithms
Document warehousing and text mining: techniques
D14 DM
for improving business operations, marketing and sales
D15 DM Data mining and knowledge discovery
Table 2.1 Documents and their categorization (example 2.1)
First of all, document-term matrix F is formed, where matrix component at (i, j)

represents occurrence number of i-th index term in j-th document. Analogously, query
vectors q1 and q2 are formed:
q1 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)T,
q2 = (0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0)T.
11
Before computing similarities between queries Q1 and Q2 and documents Di (1
≤ i ≤ 15) from the document collection, documents and queries vectors are
transformed into txn form. Because now document and query vectors are unit
vectors, as a similarity measure is used inner product between vectors.
a b
cos((a, b))   a b (2. 5)
|| a || || b ||
Documents ranking by relevance for queries Q1 and Q2 are shown in Table

3.2. For Q1 we see that 6 out of 10 relevant and none irrelevant documents are
retrieved, whereas for Q2 is not retrieved D6, which is the only relevant document.
Query Q1 Query Q2
Document Inner product Document Inner product
D15 1.4142 D15 1.4142

D12 0.7071 D3 1.1547
D14 0.5774 D7 0.8944
D9 0.5000 D12 0.7071
D11 0.5000 D4 0.5774
D1 0.4472 D8 0.5774
D2 0 D10 0.5774
D3 0 D14 0.5774
D4 0 D9 0.5000
D5 0 D11 0.5000
D6 0 D1 0.4472
D7 0 D2 0
D8 0 D5 0
D10 0 D6 0
D13 0 D13 0
Table 2.2 Document ranking by similarity with Q1 and Q2 (example 2.1)
12
From this example we see the evident disadvantage of vector space model,
and that is the fact that the only retrieved documents are those that contain index
terms contained in queries (lexically bounded). Documents that are bounded by
semantics, but do not contain index terms from the query (e.g. synonyms), shall not
be retrieved.
Example 2.2 [8] This example illustrates the application of global weighting, IDF
component. The same document collection is used, with the same queries Q 1 and
Q2, where documents collection uses tfn and queries use tfx weighting function. TF
component represents index term occurrence in a certain document. Furthermore,
IDF is computed by formula 2.9 and normalization component is computed by 2.10.
IDF components of index terms are shown in Table 2.3.
dj – number of documents in
Term IDF component
which term occurs
text 4 1.3218
mining 3 1.6094
clustering 4 1.3218
classification 2 2.0149
retrieval 4 1.3218
analysis 3 1.6094
information 3 1.6094
linear 2 2.0149
algebra 5 1.0986
matrix 4 1.3218
application 2 2.0149
document 2 2.0149
vector 3 1.6094
space 3 1.6094
data 4 1.3218
algorithm 2 2.0149
Table 2.3 IDF components of index terms (example 2.2)
13
The final document-term matrix A is attained by multiplying F matrix rows with
corresponding IDF components and by normalizing columns. Query vectors q1 and
q2 from the example 2.1 are multiplied with corresponding IDF components, so as a
result of vector representation of queries we get:
q1 = (0, 1.6094, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3218, 0)T,

q2 = (0, 1.6094, 0, 0, 0, 0, 0, 2.0149, 1.0986, 0, 0, 0, 0, 0, 1.3218, 0)T.
The result retrieved documents ranked by relevance for queries q1 and q2 are
demonstrated in Table 2.4.
Query Q1 Query Q2
Document Inner product Document Inner product
D15 2.0826 D15 2.0826

D12 0.9346 D3 1.9887
D14 0.8939 D7 1.4248
D1 0.7512 D12 0.9346
D9 0.5485 D14 0.8939
D11 0.5485 D1 0.7512
D2 0 D9 0.5485
D3 0 D11 0.5485
D4 0 D8 0.4776
D5 0 D10 0.4776
D6 0 D4 0.4557
D7 0 D2 0
D8 0 D5 0
D10 0 D6 0
D13 0 D13 0
Table 2.4 Document ranking by similarity with Q1 and Q2 (example 2.2)
14
In this example we notice a slightly different ranking of retrieved documents
then in example 2.1. Moreover, the same documents have non zero similarity and the
few top documents have the same order ranking like in example 2.1.
15
2.3. EVALUATION
In this section criterions and information retrieval evaluation techniques are

described. More about evaluation criterions can be found in [17], [18]. Evaluation is
usually performed on standard test document collections, constructed by experts
from different areas, which facilitates comparison of different information retrieval
methods. Test collections are composed of:
 document collections (science article abstracts, reports or news articles)

 queries set for document collections
 estimated document’s relevance for each query from the queries set (relevant
judgment)
As in the case of information retrieval systems, the evaluation of document

classifiers is typically conducted experimentally, rather than analytically. The reason
for this tendency is that, in order to evaluate a system analytically (e.g. proving that
the system is correct and complete) we always need a formal specification of the
problem that the system is trying to solve (e.g. with respect to what correctness and
completeness are defined), and the central notion of document classification (namely,
that of relevance of a document to a category) is, due to its subjective character,
inherently non-formalisable. The experimental evaluation of classifiers, rather than
concentrating on issues of efficiency, usually tries to evaluate the effectiveness of a
classifier, i.e. its capability of taking the right categorization decisions. The main
reasons for this bias are that:
 Efficiency is a notion dependent on the hw/sw technology used. Once this

technology evolves, the results of experiments aimed at establishing efficiency
are no longer valid. This does not happen for effectiveness, as any experiment
aimed at measuring effectiveness can be replicated, with identical results, on
any different or future hw/sw platform;
 Effectiveness is really a measure of how the system is good at tackling the
central notion of classification, that of relevance of a document to a category.
16
Classification effectiveness is measured in terms of the classic IR notions of
precision p and recall r. In order to precisely define mentioned measures, we
denote with A set of retrieved documents for a certain query. Moreover, we denote
with R relevant documents set and with Rα we denote the intersection of two
mentioned sets (Rα = A ∩ R). With |A|, |R| and |Rα| we denote cardinalities of those
sets. So recall r is computed as follows:
n
|R |
 TP i
r  (2. 6)
r n
i 1
n (2. 7)
|R|
 TPi   FN i
i 1 i 1
|R |
 TP i
p  (2. 8)
p n
i 1
n (2. 9)
| A|
 TPi   FPi
i 1 i 1
TP (true positive), FP (false positive), TN (true negative) and FN (false negative) are
described in Table 2.5.
Relevant document
Documents di, 1 ≤ i ≤ n
YES NO
Retrieved YES TPi FPi

document NO FNi TNi
Table 2.5 The contingency table for one query
Generally, with recall up growth, precision decreases and vice versa. These
measurements are closely related to each other and are computed at the same time.
Besides that, the set A is never retrieved to user at once, yet ranked document list
ordered by decreasing similarity is retrieved. Then while the user examines this list in
top-down order, precision and recall vary. Quality information retrieval insight can be
17
acquired by average precision measurement on different recall levels. Usually on 11
standard levels of recall average precision is computed. Levels: 0%, 10%, 20%, ...,
100%. Let’s denoted rk, k = 0, 1, ..., 10 k-th standard recall level. When the user
iterates through the list of retrieved documents for a certain query, usually recall does
not occur in some of the standard levels. So for computing precision on given level
P(ri) for i = 0, 1, ..., 9, the following formula is used:
P(ri )  maxP(r ) : ri  r  ri1, i  0,1,...,9 (2. 10)
Precision on 100% recall level is equal to the precision value of acquired

100% recall, and if it does not acquire 100%, then precision on 100% recall is equal
to 0. This mean average precision is computed as an arithmetic mean of 11 standard
recall levels as follows:
1 10
P   P(ri ) (2. 11)
11 i 0
Effectiveness can be measured also as the value of the Fα function, for some 0
≤ α ≤ 1, where Fα is defined as follows:
1
F  (2. 12)
1 1
  (1  )
p r
In this formula α may be seen as the relative degree of importance attributed to p and
r: If α = 1, then Fα coincides with p, if α = 0 then Fα coincides with r. Usually, a value
of α = 0.5 is used, which attributes equal importance to p and r; for reasons we do not
want to enter here, rather than F0.5 this is usually called F1 (see [17], [18] for details).
As shown in [32], for a given classifier Φ, its breakeven value is always less or equal
than its F1 value.
18
2.4. EXAMPLE OF DOCUMENTS AND QUERIES
Example 2.3 This example shows documents and queries from the MEDLINE and
CRANFIELD collection. Label .I remarks new document and the number by this label
remarks ordinal number of document. Furthermore, label .W remarks the beginning
of the document text. In Figure 2.3 we see that document is relatively short and that
contains very specific terms such as fetal, plasma, glucose, fatty, acids. Also we are
able to see in Figure 2.4 that queries are very short and clear, and contain specific
terms. Fifth query contains term fetus, while the first document, which is relevant for
the fifth query contains term fetal (lexically different from the term fetus, but has same
semantic).
.I 1
.W
correlation between maternal and fetal plasma levels of glucose and free fatty
acids .
correlation coefficients have been determined between the levels of
glucose and ffa in maternal and fetal plasma collected at delivery .
significant correlations were obtained between the maternal and fetal
glucose levels and the maternal and fetal ffa levels . from the size of
the correlation coefficients and the slopes of regression lines it
appears that the fetal plasma glucose level at delivery is very strongly
dependent upon the maternal level whereas the fetal ffa level at
delivery is only slightly dependent upon the maternal level .
Figure 2.3 The first document from MEDLINE document collection
.I 1
.W
the crystalline lens in vertebrates, including humans.
.I 3
.W
electron microscopy of lung or bronchi.
.I 4
.W
tissue culture of lung or bronchial neoplasms.
.I 5
.W
the crossing of fatty acids through the placental barrier. normal
fatty acid levels in placenta and fetus.
.I 10
.W
neoplasm immunology.
Figure 2.4 Some of the queries for MEDLINE document collection
19
In Figure 2.5 we see the first document and in Figure 2.6 some of the queries from
the CRANFIELD document collcetions. The label .T remarks the author of the
document. Unlike MEDLINE we can see that CRANFIELD documents and queries
contain less specific terms.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream .
.A
brenckman,m.
.W
experimental investigation of the aerodynamics of a
wing in a slipstream .
an experimental study of a wing in a propeller slipstream was
made in order to determine the spanwise distribution of the lift
increase due to slipstream at different angles of attack of the wing
and at different free stream to slipstream velocity ratios . the
results were intended in part as an evaluation basis for different
theoretical treatments of this problem .
the comparative span loading curves, together with
supporting evidence, showed that a substantial part of the lift increment
produced by the slipstream was due to a /destalling/ or
boundary-layer-control effect . the integrated remaining lift
increment, after subtracting this destalling lift, was found to agree
well with a potential flow theory .
an empirical evaluation of the destalling effects was made for
the specific configuration of the experiment .
Figure 2.5 The first document from CRANFIELD document collection
.I 001
.W
what similarity laws must be obeyed when constructing aeroelastic models
of heated high speed aircraft .
.I 002
.W
what are the structural and aeroelastic problems associated with flight
of high speed aircraft .
.I 004
.W
what problems of heat conduction in composite slabs have been solved so
far .
.I 008
.W
can a criterion be developed to show empirically the validity of flow
solutions for chemically reacting gas mixtures based on the simplifying
assumption of instantaneous local chemical equilibrium .
Figure 2.6 Some of the queries for CRANFIELD document collection
20
3. CLUSTERING
Clustering is the task of organizing a set of objects into meaningful groups. These
groups can be disjoint, overlapping, or organized in some hierarchical fashion. The
key element of clustering is the notion that the discovered groups are meaningful.
This definition is intentionally vague, as what constitutes meaningful is to a large
extent, application dependent. In some applications this may translate to groups in
which the pair wise similarity between their objects is maximized, and the pair wise
similarity between objects of different groups is minimized. In some other applications
this may translate to groups that contain objects that share some key characteristics,
although their overall similarity is not the highest.
Clustering is the unsupervised classification of patterns (observations, data items,

or feature vectors) into groups (clusters). The clustering problem has been addressed
in many contexts and by researchers in many disciplines; this reflects its broad
appeal and usefulness as one of the steps in exploratory data analysis. However,
clustering is a difficult combinatorial problem, and differences in assumptions and
contexts in different communities have made the transfer of useful generic concepts
and methodologies slow to occur.
Typical pattern clustering activity involves the following steps [12]:
(1) pattern representation (optionally including feature extraction and/or selection),

(2) definition of a pattern proximity measure appropriate to the data domain,
(3) clustering or grouping,
(4) data abstraction (if needed),
(5) assessment of output (if needed).
Figure 3.1 depicts a typical sequencing of the first three of these steps, including
a feedback path where the grouping process output could affect subsequent feature
extraction and similarity computations.
21
Figure 3.1 Process of data clustering
Pattern representation refers to the number of classes, the number of

available patterns, and the number, type, and scale of the features available to the
clustering algorithm. Some of this information may not be controllable by the
practitioner. Feature selection is the process of identifying the most effective subset
of the original features to use in clustering. Feature extraction is the use of one or
more transformations of the input features to produce new salient features. Either or
both of these techniques can be used to obtain an appropriate set of features to use
in clustering.
Pattern proximity is usually measured by a distance function defined on pairs

of patterns. A variety of distance measures are in use in the various communities. A
simple distance measure like Euclidean distance can often be used to reflect
dissimilarity between two patterns, whereas other similarity measures can be used to
characterize the conceptual similarity between patterns.
The grouping step can be performed in a number of ways. The output cluster
(or clusters) can be hard (a partition of the data into groups) or fuzzy (where each
pattern has a variable degree of membership in each of the output clusters).
Hierarchical clustering algorithms produce a nested series of partitions based on a
criterion for merging or splitting clusters based on similarity. Partitional clustering
algorithms identify the partition that optimizes (usually locally) a clustering criterion.
Additional techniques for the grouping operation include probabilistic and graph-
theoretic clustering methods.
22
Data abstraction is the process of extracting a simple and compact
representation of a data set. Here, simplicity is either from the perspective of
automatic analysis (so that a machine can perform further processing efficiently) or it
is human-oriented (so that the representation obtained is easy to comprehend and
intuitively appealing). In the clustering context, a typical data abstraction is a compact
description of each cluster, usually in terms of cluster prototypes or representative
patterns such as the centroid.
Different approaches to clustering data can be described with the help of the
hierarchy shown in Figure 3.2. At the top level, there is a distinction between
hierarchical and partitional approaches (hierarchical methods produce a nested
series of partitions, while partitional methods produce only one).
Figure 3.2 taxonomy of clustering approaches
One of the famous clustering systems is G-means (clustering in ping-pong style)

in [10]. Concept indexing is k-means clustering algorithm based, therefore the k-
means algorithm shall be precisely defined and described in the next chapter
(dimensionality reduction, concept decomposition).
23
4. DIMENSIONALITY REDUCTION
Different techniques for dimensionality reduction in vector space model have been
developed. There are many motives for dimensionality reduction such as: memory
space reduction for document representation, better information retrieval or
classification performance, noise and redundancy elimination in document
representation etc. Although dimensionality reduction means information reduction,
often results with more efficiency in information retrieval and classification, which
shall be confirmed later in this thesis. If dimensionality of the original vector space is
equal to index term count, dimensionality reduction can be performed in the following
two manners:
1. m~ << m term selection, feature selection

2. m~ << m term extraction, feature construction
Also dimensionality reduction methods can be supervised, if using information

about document to class attachment, or unsupervised if not using this information.
Upon dimensionality reduction using term selection, index term set T = {t 1, …, tm}
~ ~
reduces to subset T (T  T ) . The objective is to select terms endeavoring
minimum information retrieval performance reduction. On the other hand,
~
dimensionality reduction by term extraction creates new set T which is not a T
~
subset and generally terms from T will not match terms from T. This approach is
also called reparameterization (whereat number of new is less than the number of old
parameters), its drift is prevailing synonymy and polysemy problems. Dimensionality
reduction techniques map document’s representations that are close to each other in
original space, into vectors in reduced space that are closer to each other than in the
original space. This facilitates retrieval of documents that are relevant for a certain
query, withal does not necessarily contain index terms from the original space.
24
This thesis is based on term extraction information retrieval methods of latent
semantic indexing and concept indexing. The method of LSI was introduced in
1990 [33] and improved in 1995 [30]. It represents documents as approximations and
tends to cluster documents on similar topics even if their term profiles are somewhat
different. This approximate representation is accomplished by using a low-rank
singular value decomposition (SVD) approximation of the term-document matrix.
Although LSI method has empirical success, it suffers from the lack of interpretation
for the low-rank approximation and, consequently, the lack of controls for
accomplishing specific tasks in information retrieval. The explanation of Latent
Semantic Indexing efficiency in terms of multivariate analysis is provided in [3], [15],
[16]. A method by Dhillon and Modha [7] uses centroids of clusters created by the
spherical k-means algorithm or so-called concept decomposition (CD) for lowering
the rank of the term-document matrix. Applying this method, the space on which the
term-document matrix is projected is more interpretable. Namely, it is a space spread
by centroids of clusters. The information retrieval technique using concept
decomposition is called concept indexing (CI). Furthermore, the concept
decomposition method is computationally more efficient and requires less memory
than LSI.
25
4.1. LATENT SEMANTIC INDEXING
Let the m × n matrix A = [aij] be the term-document matrix. Then aij is the weight of
the i-th term in the j-th document. The standard procedure is to normalize the
columns of the matrix to be of unit norm. The term-document matrix has an important
property of being sparse, i.e. most of its elements are zeros.
A query has the same form as a document; it is a vector, which on the i-th place
has the frequency of the i-th term in the query. We never normalize the vector of the
query because it has no effect on document ranking. A common measure of similarity
between the query and the document is the cosine of the angle between them.
In order to rank documents according to their relevance to the query, we compute

s = qT A, where q is the vector of the query and the j-th entry in s represents the
score in relevance of the j-th document. The LSI method is just a variation of the
vector space model. The fundamental mathematical result that supports LSI [2] is
that for any m × n matrix A, the following singular value decomposition exists:
A  U  Σ  VT (4.1)
where U is the m × m orthogonal matrix, V is the n × n orthogonal matrix and Σ is the

m × n diagonal matrix:
Σ  diag( 1 ,..., p ) (4.2)
where p = min{m, n} and σ1 ≥ σ2 ≥ ...≥ σp ≥ 0. The σi are the singular values and ui
and vi are the i-th left singular vector and the i-th right singular vector respectively.
The second fundamental result [29] is the theorem by Eckart and Young, which
states that the distance in the Frobenius norm between A and its k-rank
approximation is minimized by the approximation Ak. Here
26
Ak  Uk Σk Vk
T
(4.3)
where Uk is the m × k matrix which columns are the first k columns of U, Vk is the n ×
k matrix which columns are the first k columns of V, and Σk is the k × k diagonal
matrix which diagonal elements are the k largest singular values of A. More precisely,
A  Ak  min AX   2k 1  ...   2 (4.4)

F rank( X )  k F rA
We call Ak truncated SVD of A and space spread by columns of Uk k -

dimensional LSI subspace. So, there are no better rank- k approximations for matrix
A then Ak in Frobenious norm. Figure 4.1 depicts process of singular value
decomposition.
Uk (m × k) Σk (k × k)
VkT (k × n)
= x x
A U Σ
Document
vectors
Term
vectors
A (m × n) U (m × m) Σ (m × n) VT (n × n)
Figure 4.1 Reduced singular value decomposition (truncated SVD)
The ranking of documents, according to their relevance for a query using LSI
method, is executed by calculating the score vector
s  q T U k k V Tk (4.5)
27
LSI Algorithm:
1. Create the term-document matrix A and the vector of the query q.

2. Use the singular value decomposition on the term-document matrix and
create a rank- k approximation Ak according to the formula (4.3).
3. Rank the documents according to their relevance to the query acording to the
formula (4.5).
28
4.2. CONCEPT INDEXING
Concept indexing is an indexing method for dimensionality reduction in vector space

model using concept decomposition (CD) of the document-term matrix. CD of A (m x
n), the document-term matrix, is an approximation with another matrix which provides
k-dimensional representation (k << m) of the documents collection. Forming CD
algorithm is performed in two basic steps: the first step is document clustering into k
clusters, and the second step is document projection on cluster centroids according
least square approximation. Dhillon and Modha in [7] are using spherical k-means
clustering algorithm that is a different version of the basic k-means clustering
algorithm, which uses unit norm document representation.
Cluster centroids are also called concept vectors. Concept vectors represent
concepts from the clustering index terms and can be used as a model for
classification of the later folded documents into the documents collection. One of the
main CI advantages over LSI is the interpretability of the concept vector because
they are local, unlike LSI’s singular vectors which are not interpretable and they are
global. Furthermore, CI is less complex and uses less memory then LSI [7]. While
LSI is well theoretically based, CI has no theoretical baseline.
CI method has two different versions: unsupervised and supervised.

Unsupervised method is used over document collections that have not been
classified by experts, while over document collections that have been classified by
experts can be used both, supervised and unsupervised method. Supervised method
simply skips the first step of clustering documents in k clusters and cluster centroids
are formed on expert’s classification basis.
CI’s target is to approximate each document vector by a linear combination of

concept vectors. Define the concept matrix as a m x k matrix which j -th column is
the concept vector cj, that is
Ck  c1 , c2 ,..., ck  . (4.6)
29
If we assume linear independence of the concept vector then it follows that the
concept matrix has rank k . Now we define concept decomposition Dk of the
document-term matrix A as the least-squares approximation of A on the column
space of the concept matrix Ck. Concept decomposition is m x n matrix
Dk  Ck Z
*
(4.7)
where Z* is the solution of the least-squares problem ([28])
Z*  min A  Ck Z (4.8)
Z
that is

Z*  Ck Ck
T

1 T
Ck A (4.9)
In this thesis two types of CI algorithms are used. One is spherical k-means
and the other is fuzzy k-mean. In following subsections both types are described.
Concept decomposition algorithm:
1) Compute documents cluster centroids ci, i = 1, 2, …, k by using some of

clustering algorithms (spherical or fuzzy k-means) from the following
subsections.
2) Form concept matrix Ck = [c1 c2 … ck].
3) Compute document representation matrix Z* by using formula (4.9)
30
4.2.1. SPHERICAL K-MEANS
Suppose we are given n document vectors a1, a2, ..., an in Rm ≥ 0. Let π1, π2, ..., πk
denote a partitioning of the document vectors into k disjoint clusters such that ([7])
   a , a
j 1
j 1 2 ,..., a n and  j   l   if i  l . (4.10)
For each fixed 1 ≤ j ≤ k, the mean vector or the centroid of the document vectors
contained in the cluster πj is
1
mj 
nj
a
a  j
(4.11)
where nj is the number of document vectors in πj . Note that the mean vector mj need
not have a unit norm; we can capture its direction by writing the corresponding
concept vector as
mj
cj  (4.12)
|| m j ||
The concept vector cj has the following important property. For any unit vector
z in Rm, we have from the Cauchy-Schwarz inequality that
 a
a
T
z  a
a
T
cj . (4.13)
j j
Thus, the concept vector may be thought of as the vector that is closest in
cosine similarity (in an average sense) to all the document vectors in the cluster π j.
Motivated by (4.9), we measure the “coherence” or “quality” of each cluster π j 1 ≤ j ≤
k, as
31
a


a T
cj (4.14)
j
Observe that if all document vectors in a cluster are identical, then the average
coherence of that cluster will have the highest possible value of 1. On the other hand,
if the document vectors in a cluster vary widely, then the average coherence will be
small, that is, close to 0. Since  a
a
T
c j  n j m j and ||cj|| = 1, we have that
j
 a c j  n j m j c j  n j m j c j c j n j m j   a
T T T T
cj (4.15)
a j a j
This rewriting yields the remarkably simple intuition that the quality of each
cluster πj is measured by the L2 norm of the sum of the document vectors in that
cluster. We measure the quality of any given partitioning {πj}kj=1 using the following
objective function:

Q  j     a
k
k T
j 1 cj (4.16)
j 1 a  j
Intuitively, the objective function measures the combined coherence of all the
k clusters. Such an objective function has also been proposed and studied
theoretically in the context of market segmentation problems (Kleinberg et al., 1998).
32
Spherical k-means algorithm:
1) Initialize clustering. Start with some initial partitioning of the document vectors,

namely  j ( 0) k
j 1 
. Let  j ( 0) 
k
j 1 . be the concept vectors of the associated
partitioning. Set the iteration count t to 0.
2) For each document vector ai, 1 ≤ i ≤ n, find the concept vector closest in
cosine similarity to ai. Now, compute the new partitioning  j ( t 1)  
k
j 1 induced
by the old concept vectors c j ( t 1)   k

j 1 :
 j (t 1) aai i 1n : aT c j(t )  aT cl (t ) ,1 l  n, l  j, 1 j  n

(4.17)
In words, πj(t+1) is the set of all document vectors that are closest to the
concept vector cj(t). If it happens that some document vector is simultaneously
closest to more than one concept vector, then it is randomly assigned to one
of the clusters. Clusters defined using (4.13) are known as Voronoi or Dirichlet
partitions.
3) Compute the new concept vectors corresponding to the partitioning computed

in (5.13):
( t 1)
( t 1) mj
cj  ( t 1)
,1  j  k (4.18)
mj
where m j ( t 1) denotes the centroid or the mean of the document vectors in
cluster  j ( t 1) .
4. If some “stopping criterion” is met, then set  j F   j (t 1) and set

( t 1)
cj
F
 cj for 1 ≤ j ≤ k, and exit. Otherwise, increment t by 1, and go to
step 2 above. An example of a stopping criterion is: Stop if
33

Q j 
(t ) k
j 1   Q j 
( t 1) k
j 1   , (4.19)
for some suitably chosen ε ≥ 0. In words, stop if the “change” in objective

function after an iteration of the algorithm is less than a certain threshold. We
now establish that the spherical k-means algorithm outlined above never
decreases the value of the objective function.
Like any other gradient-ascent scheme, the spherical k-means algorithm is

prone to local maxima. A careful selection of initial partitions  j ( 0 )  
k
j 1 is important.
One can either (a) randomly assign each document to one of the k clusters, (b) first
compute the concept vector for the entire document collection and randomly perturb
this vector to get k starting concept vectors or (c) try several initial clusterings and
select the best in terms of the largest objective function. In my implementation I use
strategy (b).
34
4.2.2. FUZZY K-MEANS
The fuzzy k -means algorithm (FKM) (see [9], [26], [27], [34]) generalizes classical or
hard k -means algorithm. The goal of k -means algorithm is to cluster n objects (here
documents) in k clusters and find k mean vectors for clusters (centroids). In the
context of the vector space model for information retrieval we call these mean vectors
concepts. Spherical k -means algorithm used in [7] is just a variation of hard k -
means algorithm which uses a fact that document vectors (and the vectors of
concepts) are of the unit norm.
As opposed to hard k -means algorithm which allows a document to belong to

only one cluster, FKM allows a document to partially belong to multiple clusters. FKM
seeks a minimum of a heuristic global cost function
k n
   bij a j  ci
2
J fuzz , (4.20)
i 1 j 1
where aj, j=1, …, n are vectors of documents, ci, i=1, .., k are concept vectors, μij is
fuzzy membership degree of document aj in the cluster whose concept is ci and b is a
weight exponent in fuzzy membership. In general, the JFuzz criterion is minimized
when the concept ci is near those points that have high fuzzy membership degree for
 J fu zz  J fu zz
cluster i, i=1, .., k. By solving a system of equations and we
 ci   ij
achieve stationary point
i 1,, k
1
 ij  1
,
j 1,, n. (4.21)
  b 1
2
k
a j  ci

 2


a  cr

r 1

35
n

j 1
b
ij aj
c i
 n
, i  1,  , k (4.22)

j 1
b
ij
for which the cost function achieves local minimum. We will get concept vectors using
following iterative procedure:
Fuzzy k-means algorithm:
ci , i  1,, k. Set the index of

( 0)
1. Start with arbitrary concept vectors
iteration t  0.
2. Compute the fuzzy membership degrees  ij ( 0) according to formula (4.21).

(0)
3. Compute the cost function J fuzz according to formula (4.20).
( t 1)
4. Compute new concept vectors ci according to formula (4.22).
( t 1)
5. Compute new fuzzy membership degrees  ij according to formula
(4.21).
( t 1)
6. Compute a new cost function J fuzz according to formula (4.20).
7. If J fuzz  J fuzz   for some threshold

( t 1) (t )
 then stop and return concept
vectors; else go to step 4.
36
4.3. LSI AND CI COMPARISON EXAMPLE
Example 4.1 [8] In this example LSI (SVD) and CI (CDFKM) using fuzzy k-means are
compared for the document collection and queries from example 2.1.
Document-term matrix is created and normalized the columns of it to be of the

unit norm. To such a matrix we have applied CDFKM (concept decomposition with
fuzzy k-means, k=2) and truncated SVD (k=2). Let the truncated SVD be U 2  2 V2
T
*
and CDFKM be C2 Z . In truncated SVD, rows of U2 are the approximate (two-
dimensional) representation of terms, while rows of V2 are the approximate (two-
dimensional) representation of documents. Here we neglect Σ 2 part, since Σ2 is a
diagonal matrix and produces only scaling of the axes. In CDFKM, rows of C2 are
approximate representations of terms and columns of Z* are approximate
representations of documents. Coordinates of terms are listed in Table 4.1, while
coordinates of documents and queries are listed in Table 4.2. Also, on Figure 4.2 and
4.3 images of terms are plotted. From Figure 4.2 we can see that images of two
groups of terms, data mining (DM) terms and linear algebra (LA) terms are grouped
together in the case of truncated SVD. In the case of CDFKM, two groups of terms
are generally grouped along the axes: along y axis (and near y axis) we have DM
terms, and along x axis we have LA terms. Exceptions are terms information and
retrieval. Our assumption is that this is because the model was confused by D 6
document, which contains these terms and LA terms.
Most of the DM documents do not contain words data and mining. Such
documents will not be recognized by the simple term-matching vector space method
as relevant. Document D6, relevant for Q2, does not contain any of terms from the list
contained in the query. In the vector space model, the query has the same form as
the document. Let q be a representation of the query in the vector space model and
q~ its approximate representation using truncated SVD.
37
Term SVD CDFKM
xi yi xi yi
text 0.21 -0.31 0.10 0.43
mining 0.16 -0.29 0.01 0.42
clustering 0.24 -0.41 0.08 0.48
classification 0.12 -0.18 0.00 0.23
retrieval 0.27 -0.20 0.29 0.11
analysis 0.21 -0.11 0.29 0.00
information 0.09 -0.16 0.00 0.32
linear 0.25 -0.41 0.10 0.47
algebra 0.19 0.14 0.20 0.00
matrix 0.50 0.40 0.54 0.00
application 0.37 0.25 0.40 0.00
document 0.29 0.19 0.32 0.00
vector 0.29 0.20 0.32 0.00
space 0.21 -0.09 0.19 0.12
data 0.19 0.14 0.20 0.00
algorithm 0.11 -0.17 0.18 0.07
Table 4.1 Coordinates of the terms by SVD and CDFKM
Document LSI Space CI Space

xi yi xi yi
D1 0.24 -0.35 0.06 0.74
D2 0.24 -0.20 0.38 0.26
D3 0.33 0.26 0.69 -0.14
D4 0.33 0.26 0.69 -0.14
D5 0.11 -0.19 -0.04 0.54
D6 0.34 0.08 0.74 -0.10
D7 0.35 0.22 0.71 -0.08
D8 0.34 0.26 0.71 -0.14
D9 0.22 -0.25 0.38 0.25
D10 0.34 0.26 0.71 -0.14
D11 0.22 -0.31 0.06 0.64
D12 0.19 -0.33 -0.01 0.67
D13 0.13 -0.23 0.11 0.37
D14 0.14 -0.25 -0.08 0.69
D15 0.16 -0.28 -0.05 0.64
Q1 0.22 -0.40 -0.07 0.90
Q2 0.57 -0.03 0.70 0.75
Table 4.2 Coordinates of documents and queries by SVD and CDFKM
38
0,5
0,4
0,3
0,2 DM terms
0,1 LA terms
0,0
y
-0,1
-0,2
-0,3
-0,4
-0,5
0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5
x
Figure 4.2 Images of terms by LSI
0,50
0,45
0,40
DM terms
0,35
LA terms
0,30
0,25
y
0,20
0,15
0,10
0,05
0,00
0 0,1 0,2 0,3 0,4 0,5 0,6
x
Figure 4.3 Images of terms by CI
39
Then, the following is satisfied
q  U2 Σ2 q
~ T  q
~  qT 1
U 2 Σ2
On the other side, since documents are represented as columns of
Z  C k C k  C k A in CD, the approximate representation of the query by CD will be

* T 1 T
q  CTk C k  CTk q . In Figure 4 and 5, images of approximate representations of

1
documents and queries are plotted. In the SVD projection, DM documents form one
group, LA documents another and the D6 document is isolated. In the CD projection,
LA documents are grouped; DM documents are somewhat more dispersed, while D 6
document is in the group of LA documents. Shaded areas represent the area of
relevant documents for queries in the cosine similarity sense.
Retrieved documents for query Q1 in descending order, due to their score for
the term-matching method, are: D15, D12, D14, D9, D11 and D1. Other documents are
not retrieved at all, since their score is 0. So, the term-matching method has retrieved
6 out of 10 relevant documents. The retrieved documents for Q 1 applying LSI are: D1,
D11, D12, D9, D15, D2, D14, D13, D5 and D6. The score for other documents is much
lower and we can state that other documents are not retrieved at all. The retrieved
documents are exactly all the relevant documents. The retrieved documents for Q 1
applying CI are: D1, D14, D12, D11, D15, D5, D13, D2 and D9. These are all the relevant
documents except D6 document. For query Q2, only D6 document is relevant. The
term-matching method does not retrieve it at all, the LSI method recognizes D 6 as the
most relevant document (although it does not contain any term from the query) and
the CI method retrieves D6 as the sixth most relevant document.
As a conclusion of this academic example we can state that the LSI and CI
methods have a similar effect; they cluster documents on the similar topic even if
their term profile is different. It seems that on this example, LSI is working better. In
the EXPERIMENTS AND RESULTS section, we compare these two techniques on
much larger document collections to achieve statistically significant comparisons.
40
0,3
0,2
0,1
0,0
-0,1 Q2
y
-0,2
DM documents
-0,3
LA documents
Document D6
-0,4
Q1 Queries
-0,5
0 0,1 0,2 0,3 0,4 0,5 0,6
x
Figure 4.4 Images of documents and queries by LSI
1,2
1,0
Q1
0,8 Q2
0,6
y
0,4
0,2
DM documents
LA documents
0,0 Document D6
Queries
-0,2
-0,1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8
x
Figure 4.5 Images of documents and queries by CI
Document clustering of document collection from this example is performed by

our implemented system. In section System implementation another clustering of
these collection is presented.
41
4.4. LSI FOLDING-IN ALGORITHM
Document collections on which information retrieval is performed are often

dynamically organized. Adding and removing documents from document collection is
continuously performed. The best example of dynamic document collection is the
World Wide Web (www). In addition, adding new documents into a document
collection causes new index terms appearances, so it’s necessary to update the
index term list. This thesis only emphasises adding new documents (folding in) into a
document collection.
It is not an obstacle to add new documents into a document collection in a vector

space model. A problem appears when document collection is represented into
reduced dimensionality space. Namely, vectors upon which during dimensionality
reduction projection is performed are acquired using the whole document collection
and folding new documents in would require re-computing singular and concept
vectors. That’s not, of course, practical. In this thesis only folding-in method without
re-computation of the transformation matrix for LSI is presented. Expanded
document-term matrix A is acquired after adding new documents, where:
 A1 represents initial documents matrix in the initial terms space

 A2 represents new documents matrix in the initial terms space
A  A1 A2  (4.23)
With m we denote initial terms count, n1 we denote initial documents count and
n2 we denote new documents count. Then A1 matrix is the size of m x n1, A2 matrix is
the size of m x n2. Furthermore, txn(A1)  A1 and txn(A2)  A2 are performed, in the
other words A1’s and A2’s column vectors are normalized.
42
Truncated SVD is represented as A1k = Uk ∑k VkT. Query vectors matrix Q is a
size of q x m (queries are containing only initial terms).
This algorithm represents documents and queries in the LSI space without
implementation of new index terms coordinates and without correction of the singular
vectors. Algorithm is performed as described bellow.
Algorithm:
1) Compute new documents vectors representation DNEW in the space of reduced

1
DNEW  A2 Uk Σk
T
dimensionality by using formula
2) Compute queries vectors representation Queries in the reduced

1
dimensionality space by using formula Queries QT Uk Σk
3) Form document matrix DALL which shall be composed of the vector

representations of initial and new documents in the reduced dimensionality
 Vk 
space DALL   
DNEW 
43
5. SYSTEM IMPLEMENTATION
Our system implements two dimensionality reduction methods. Concept Indexing (CI)
and Latent Semantic Indexing (LSI), whereas with more emphasis on CI method.
Because of lack of time, LSI method we have developed in MATLAB programming
language, and CI method we have developed within three main components, and is
more detailed and elaborated.
The first component of CI method system is Intel’s Math Kernel Library (BLAS and
LAPACK Libraries [11]). Because in this thesis IR is based on vector space model,
we found Intel’s BLAS and LAPACK libraries suitable for using for vector space
model implementation. The second component is the core implementation (back-
end), which is written in Microsoft Visual Studio .NET C++ programming language,
and finally the third component is a graphical user interface – GUI (front-end), written
in Microsoft Visual Studio .NET C# language. Each system’s component is described
in the following paragraphs.
1) The BLAS and LAPACK libraries are time-honored standards for solving a
large variety of linear algebra problems. The Intel Math Kernel Library (Intel
MKL) contains an implementation of BLAS and LAPACK that is highly
optimized for Intel processors. Intel MKL can enable you to achieve significant
performance improvements over alternative implementations of BLAS and
LAPACK.
BLAS
Basic Linear Algebra Subroutines (BLAS) provide the basic vector and matrix
operations underlying many linear algebra problems. Intel MKL BLAS support
includes:
 BLAS Level 1 - vector-vector operations

 BLAS Level 2 - vector-matrix operations
 BLAS Level 3 - matrix-matrix operations
 Sparse BLAS - an extension of BLAS Level 1
44
For BLAS Levels 2 and 3, multiple matrix storage schemes are provided. All
BLAS functions within Intel MKL are thread-safe. Gain the performance
enhancements of multiprocessing without altering your application using
parallelized (threaded) BLAS Level 3 routines.
Sparse BLAS
Sparse BLAS is a set of functions that perform a number of common vector
operations on sparse vectors stored in compressed form. Sparse vectors are
those in which the majority of elements are zeros. Achieve large savings in
computer time and memory with sparse BLAS routines and functions that have
been specially implemented to take advantage of vector scarcity.
LAPACK
Intel MKL includes Linear Algebra Package (LAPACK) routines that are used
for solving:
 Linear equations
 Eigenvalue problems
 Least-squares problems
 Singular value problems
LAPACK routines support both real and complex data. Routines are supported
for systems of equations with the following types of matrices: general, banded,
symmetric or Hermitian, triangular, and tridiagonal. The LAPACK routines
within Intel MKL provide multiple matrix storage schemes. LAPACK routines
are available with a FORTRAN interface.
2) Back-end is composed of two parts. The first part is the core implementation of
concept indexing functionalities using Intel’s MKL library (written in
unmanaged c++). Because we decided to use C# for GUI, it was necessary to
transform basic core implementation from unmanaged to managed C++. So
the second part of back-end is wrapper-class which is used as mediator
between core back-end and GUI and is written in manage C++.
45
3) Front-end is written in C# programming language, because we found C# very
suitable language for visualization. GUI contains three different tab-windows.
Single test, Batch testing and Query testing windows.
Figure 5.1 Single test tab-window screenshot
In Single test window (see Figure 5.1) we can perform CI algorithm step by
step or to end with options of adjusting many algorithm parameters such as clustering
algorithm type (Spherical or Fuzzy k-means), number of concept vectors, type of
concept vectors initialization (Random values, Random documents or Perturbed
centroids) and initial percentage of documents from document collection to select.
Also we have implemented 2D example mode in which we are able to perform CI on
two types of examples.
46
In the first example (example type-1), on the graph of the right side of this
window, we are able to define documents in the positive quadrant by clicking in
certain point of the graph (Figure 5.2), and after that clustering those documents by
performing some of the clustering algorithms (Figure 5.3). Blue crosses represent
documents and red crosses represent concept vectors.
Figure 5.2 Documents before clustering (example type-1)
Figure 5.3 Documents after clustering (example type-1)
47
In the second example (example type-2), on the graph of the right side of this
window, we are able to perform clustering (CD and SVD) on document collection
defined in Example 2.1 in section 2.2. In the other words Example 4.1 from section
4.2 can be re-performed. In Figure 5.4 are shown document representations of
documents and queries in CI space (different example of clustering presented in
Figure 4.5).
Figure 5.4 Document and query representations in CI space (example type-2)
48
Figure 5.5 depicts representations of terms in CI space (different example of
clustering presented in Figure 4.3).
Figure 5.5 Term representations in CI space (example type-2)
49
Figure 5.6 depicts representations of documents and queries in LSI space (different
example of clustering presented in Figure 4.4).
Figure 5.6 Document and query representations in LSI space (example type-2)
50
Figure 5.7 depicts representations of terms in LSI space (different example of
clustering presented in Figure 4.2).
Figure 5.7 Term representations in LSI space (example type-2)
There are also algorithm informations such as documents and terms count,
memory consumptions and algorithm outputs in Single test tab-window.
Batch testing tab-window is composed of three sub tab-windows. Those are

Options (Figure 5.8), Test list and results (Figure 5.9) and Graph tab-window (Figure
5.10).
51
In Options tab-window we are able to adjust many algorithm batch performing
parameters. Parameters like number of repeats for certain test, concept vectors
number range (of fixed number), clustering algorithm (Spherical, Fuzzy k-means or
both), type of concept vectors initialization (Random, Random documents or
Perturbed centroids), minimum and maximum term frequencies, initial documents
percentage range (or fixed percentage) and type of query matching condition. After
we adjust parameters for batch testing, then in the Batch menu we select Start Batch.
We are able to pause or stop batch testing any time, because we have implemented
it as a separate thread.
Figure 5.8 Batch test tab-window (Options) screenshot
Batch results are presented in the Test list and results tab-window. Each test
is represented as one record (row) of the table. Columns of the table represent test
attributes such as number of repeats, clustering algorithm (Spherical or Fuzzy k-
means), type of concept vector’s initialization, number of concept vectors, number of
iterations for CI, average concept vectors dot product, decomposition error, mean
average precision (MAP) and F1 measure, and algorithm performing time.
52
Figure 5.9 Batch test tab-window (Test list and results) screenshot
In the Graph tab-window we are able to define parameters for batch testing
visualization. We can define coordinates of x and y axes. In the x axis we are able to
observe number of concept vectors, initial document percentage and fuzzy exponent
b. We can define two y axes and observe the following measures: average concept
vectors dot product, decomposition error, number of iterations, test duration, memory
consumption, recall measure (min, max and avg), precision (min, max and avg), MAP
(mean average precision) and F1 measure (min, max and avg). We are able to define
four types of graphs. Those are: points for each repeat in the test; test repeats
average values; test repeats median values and test’s mean and variations values.
Also we can define colors and labels for each graph.
Graph refreshing is real-time, because drawing is defined as a separate thread.
53
Figure 5.10 Batch test tab-window (Graph) screenshot
In the Query testing tab-window (Figure 5.11) query testing and analyzing is
enabled. Query analyzing is enabled through four tables and graph visualization. In
the first table each row represents one query (query id, F 1, MAP, Recall, Precision,
number of relevant documents, number of retrieved documents maximum query-
document similarity. The content of the three other tables dynamically changes,
depending which row (query) is selected in the first table. The second table contains
terms for certain query. The third table contains relevant documents ids. And the last
table contains retrieved document ids and similarities. There are three types of query
matching. Similarity limit (retrieves all documents having similarity greater or equal
then that similarity limit), Fixed number n of retrieved documents (retrieves the best n
documents by similarity) and finally proportion limit (retrieved documents number is
equal to proportion limit times number of relevant documents). We can draw recall-
precision graphs for each query separately or an average of all queries. Also we are
able to draw queries relevant documents distributions.
54
Figure 5.11 Query testing tab-window screenshot
55
6. EXPERIMENTS AND RESULTS
In this section different experiment results shall be presented. Experiments are based
on comparison of two dimensionality reduction methods, latent semantic indexing
and concept indexing (for spherical and fuzzy k-means clustering). These two
methods are compared in two parameters.
 Error of the vector representations matrix of documents in the original space

(bag of words representation) and matrix of documents in the reduced
dimensionality space.
 Information retrieval evaluation
Experiments are performed on two standard document collections MEDLINE and

CRANFIELD for information retrieval. MEDLINE is a standard document collection
which contains 1033 article abstracts related to medicine and has 30 defined queries,
and CRANFIELD is a standard document collection that contains 1400 documents
related to aeronautics and has 225 queries defined. Both collections are obtained by
SMART system ([19]) for converting text documents into matrix representations.
Furthermore, it is shown that:
 Concept vectors built by some of the defined clustering algorithms (spherical

and fuzzy k-means) tend to become orthogonal (convergent)
 Information retrieval implementation greatly depends of the correctly chosen
dimension of the reduced dimensionality space, which corresponds to the
natural clusters count
Also experiments are performed for the defined folding-in method for LSI (only for
MEDLINE document collection).
And memory consumption and performance time experiments are performed and
results presented.
56
6.1. DECOMPOSITION ERROR
Decomposition error is computed as a distance of the document matrix in the original
space and the document matrix in the reduced dimensionality space A  Ck Z F

. In
Figure 6.1 shows a graph of the decomposition error regarding number of

concept/singular vectors used for clustering in MEDLINE document collection. We
see that SVD is giving better results than CD regarding decomposition error, but we
are also able to see that CD using Fuzzy k-means clustering is better than using
Spherical k-means clustering.
Figure 6.1 Relation between the number of concept/singular vectors and the
decomposition error for MEDLINE
57
Figure 6.2 depicts the relation between the decomposition error and the
number of concept/singular vectors for CRANFIELD document collection. This graph
confirms that SVD is better at decomposition than CD even for CRANFIELD
document collection, and also confirms that CD using fuzzy k-means clustering is
slightly better than CD using spherical k-means clustering for CRANFIELD collection.
decomposition error for CRANFIELD
58
6.2. ORTHONORMALITY OF CONCEPT VECTORS
Within this experiment is shown that concept vectors tend towards orthonormality.
Figure 6.3 depicts the average concept vectors dot product regarding the number of
concept vectors for MEDLINE document collection. With the increased number of
concept vectors, their average dot product tends towards 0, thereby they tend
towards orthonormality. CD using fuzzy k-means clustering acquires more tendency
towards orthonormality than CD using spherical k-means clustering on MEDLINE
collection.
Figure 6.3 Relation between the number of concept vectors and the average concept
vectors dot product for MEDLINE
59
Unlike CD on MEDLINE, Figure 6.3 depicts that there are no greater
differences in concept vector’s tendency towards orthonormality between CD using
fuzzy and spherical k-means clustering, on CRANFIELD document collection. But yet
it is still shown that concept vectors tend towards orthonormality even though with
less orthonormality tendency than for MEDLINE collection. This is probably because
MEDLINE has less documents than CRANFIELD (1033 comparing to 1398
respectively), but yet almost double index terms (7014 comparing to 3763
respectively). With a greater number of index terms it is easier to acquire
orthonormality because document vectors are very sparse.
Figure 6.4 Relation between the number of concept vectors and the average concept
vectors dot product for CRANFIELD
60
6.3. CI PERFORMANCES
In this section different CI performances are measured. Those are number of

iterations, test duration and memory consumption regarding number of concept
vectors.
Figure 6.5 shows CI using fuzzy clustering generally needs more iterations than
CI using spherical clustering, for MEDLINE document collection.
Figure 6.5 Relation between the number of concept vectors and the number of
iterations for MEDLINE
61
Figure 6.6 corroborates that CI using fuzzy clustering generally needs more iterations
than CI using spherical clustering even for CRANFIELD set.
Figure 6.6 Relation between the number of concept vectors and the number of
iterations for CRANFIELD
62
Tests performed with CI using fuzzy k-means clustering last more than tests
with CI using spherical clustering. This is because computation of fuzzy weights.
Figure 6.7 shows test duration for MEDLINE and Figure 6.8 shows test duration for
CRANFIELD.
Figure 6.7 Relation between the number of concept vectors and the test duration (in
seconds) for MEDLINE
63
Figure 6.8 Relation between the number of concept vectors and the test duration (in
seconds) for CRANFIELD
64
In Figure 6.9 we see that only for less than 25 concept vectors it has acquired
memory saving (for MEDLINE). We are also able to see that memory consumption
for CI linearly progresses and that generally CI using fuzzy clustering needs more
memory than CI using spherical clustering. Fuzzy weight matrix (number of
documents times number of concept vectors) is additional in CI using fuzzy clustering
and that’s why fuzzy clustering uses more memory.
Figure 6.9 Relation between number of concept vectors and memory consumption
for MEDLINE
65
Figure 6.10 shows CI’s memory consumption for CRANFIELD.
Figure 6.10 Relation between number of concept vectors and memory consumption
for CRANFIELD
We can conclude that CI using fuzzy clustering needs more iteration, time and
more memory than CI using spherical, while performing.
66
6.4. INFORMATION RETRIEVAL EVALUATION
In this section results of experiments for information retrieval evaluation are

presented. First IR is evaluated without folding-in documents in the document
collection (MEDLINE and CRANFIELD). And after that IR evaluation regarding
folding-in for LSI is performed. 50 best ranked documents by similarity with query are
retrieved and four measures for IR evaluation are computed. Recall, precision, mean
average precision (MAP) and F1 measures upon all queries.
In subsection 6.3.1 experiments without folding-in documents and in 6.3.2 with

folding-in documents into document collection are performed.
67
6.4.1. WITHOUT FOLDING-IN DOCUMENTS
Figure 6.11 shows the average recall for all queries regarding number of
concept/singular vectors for MEDLINE document collection. Recall for LSI is less
than recall for CI. Furthermore, regardless which clustering algorithm is used in CI,
pretty similar recall is retrieved.
average recall of all queries for MEDLINE
68
Figure 6.12 depicts the average recall for all queries regarding number of
concept/singular vectors for CRANFIELD document collection. Unlike for MEDLINE,
for CRANFIELD is inversely, thus recall is greater for LSI than CI. And similar to
MEDLINE collection, spherical and fuzzy k-means clustering algorithms give similar
recall.
average recall of all queries for CRANFIELD
69
Moreover, relation between precision and number of concept/singular vectors
for MEDLINE is shown in Figure 6.13 and for CRANFIELD in Figure 6.14.
In Figure 6.13 we see that precision for LSI is greater than for CI. So far LSI
has greater precision than CI, but less recall for MEDLINE. Spherical k-means gives
slightly better results than fuzzy k-means clustering CI, regarding precision.
average precision of all queries for MEDLINE
70
On CRANFIELD collections we see that both, precision and recall are greater
for LSI than CI. There is almost no difference between spherical and fuzzy k-means
clustering CI.
average precision of all queries for CRANFIELD
So far we have presented recall and precision as measures, but these

measures solely are not adequate and satisfactory for information retrieval
evaluation. More adequate measures are MAP and F1, so these are presented in the
following experiment results.
71
Figure 6.15 shows query-document matching in the original space and in the
reduced dimensionality space (LSI and CI). We are able to see that generally LSI
gives greater MAP than CI, but we are also able to see that LSI and CI (spherical and
fuzzy k-means clustering) reach maximum MAP for 50 concept vectors. We can
assume that MEDLINE document collection can be naturally divided into 50 clusters.
Furthermore, CI using fuzzy k-means clustering gives greater MAP than using
spherical clustering, for the natural number of clusters (50), but for more clusters it is
vice versa. This graph also depicts that better results are acquired in reduced then in
the original space, although information reduction is performed. This is because LSI
and CI are resolving the problems of synonymy and polysemy, redundancy and
noise.
average MAP of all queries for MEDLINE
72
Unlike MEDLINE, results for experiments performed on CRANFIELD (see
Figure 6.16) are different. Better results regarding MAP are acquired in the original
than in the reduced dimensionality space. Also LSI MAP graph grows monotonously;
in the other words does not reaches maximum. CI MAP graphs start reaching
saturation at 100 concept vector. The MAP graph for spherical CI almost completely
matches the MAP graph for fuzzy CI.
average MAP of all queries for CRANFIELD
73
Figure 6.17 shows much better results for LSI than for CI, regarding F 1
measure. Also we can see that LSI gives better results than query-document
matching in the original space. Furthermore, CI using spherical k-means gives
slightly better F1 measure than CI using fuzzy clustering.
average F1 of all queries for MEDLINE
74
CI method on CRANFIELD also gives less F1 measure values then LSI (see
Figure 6.18). F1 measure value for query-document matching in the original space is
reached by LSI method for 100 singular vectors.
average F1 of all queries for CRANFIELD
75
In Figure 6.19 we see the recall-precision graph for CI (fuzzy clustering) with
50 concept vectors performed on MEDLINE document collection.
Figure 6.19 Recall-precision graph for MEDLINE
Furthermore, in Figure 6.20 we see the recall-precision graph for CI (fuzzy

clustering) with 100 concept vectors performed on CRANFIELD document collection.
76
Figure 6.20 Recall-precision graph for CRANFIELD
As a conclusion of these experiments we can state that generally LSI gives

better results for IR evaluation than CI, in both document collections (MEDLINE and
CRANFIELD). Also that better IR evaluation results are acquired for MEDLINE then
for CRANFIELD. Furthermore, differences between CI using spherical and fuzzy k-
means clustering are slight.
77
6.4.2. WITH FOLDING-IN DOCUMENTS
In this section folding-in methods for LSI is tested only for MEDLINE document
collection. Tests are performed starting with 10% and finishing with 100% of the initial
documents, with steps of 10%.
Figure 6.21 shows that recall grows by greater initial documents percentage.
We can also see that for initial documents of 80% percent of all documents from the
document collection is reached maximum recall. This is probably because between
80% and 100% initial documents, doesn’t occur (or very small amount) new terms.
Figure 6.21 Relation between the initial document percentage and recall
78
Precision graphs (Figure 6.22) have the same characteristics as recall graphs.
Figure 6.22 Relation between the initial document percentage and precision
79
Figure 6.23 shows MAP measure.
Figure 6.23 Relation between the initial document percentage and MAP
80
And finally Figure 6.24 depicts F1 measure.
Figure 6.24 Relation between the initial document percentage and F1
Conclusion for this experiment is that with increasing initial document

percentage, IR evaluation gives better result. Also this experiment shows that most of
the terms are covered in 80% of initial documents.
81
7. RELATED WORKS
Dhillon and Modha in their work [7] compare concept decomposition (CD), which they
have developed, to singular value decomposition (SVD), in terms of similarity of the
original document matrix and the matrix approximation. But in this work they don’t
research the convenience of document representation using concept decomposition
for text mining tasks like information retrieval or text classification.
Dimensionality reduction methods using cluster (class) centroids were developed

in [13], [35], [36], [37] and [38]. In Karypis’s and Hong’s work [13] dimensionality
reduction technique is used using class centroid projection, but not based on least
mean squares (as is the case with concept decomposition). They research
effectiveness of this technique by information retrieval evaluation, where they use
both, supervised and unsupervised algorithm version.
Park and associate workers in [35] have developed an orthonormal centroids

based algorithm for dimensionality reduction and they compare its effectiveness with
concept decompositions for text classification needs. Ye and associate workers in
[38] have presented IDR/QR algorithm for dimensionality reduction which uses QR
decomposition ([23]).
Also dimensionality reduction methods using kernel functions can be

implemented. Park and Park [37] have implemented orthonormal centroid algorithm
using kernel functions. This approach enables effective impose of non-linearity for
classification using support vector machines (SVM).
Dhillon and associate workers in [6], provoked by poor effectiveness of the

spherical k-means clustering algorithm for concept decomposition present divisive
information-theoretic feature clustering algorithm, where as target function they use
measures from information theory discipline.
82
Kogan and associate workers in [14] perform optimization of k-means by
combining batch and incremental k-means clustering algorithm. They use a distance-
like function which combines Euclidian distance and relative entropy.
In [9] fuzzy k-means clustering algorithm is presented and in [8] a comparison of

CI and LSI is presented on a simple example.
Folding-in new document in reduced dimensionality space for LSI is presented in

[2], where SVD updating retain, but SVD folding-in doesn’t retain othonormality of
singular vectors. In this work also they use semidiscreet decomposition for LSI ([25]).
The incremental algorithm for dimensionality reduction IDR/QR is presented in

[38]. This algorithm updates a transformation matrix for document transformation in
reduced dimensionality space, during folding-in new document into document
collection. This dimensionality reduction method is supervised and authors of this
method test its effectiveness for text classification needs.
83
8. CONCLUSIONS
In this thesis the LSI and CI reduction dimensionality methods are experimentally
compared.
We have shown with our experiments that concept vector tend towards
orthonormality. Concept vectors are local and have well defined semantic, while
singular vectors are global and cannot be interpreted. Also concept vectors are very
sparse (often more than 85%). Regarding approximation error, LSI gives the best
approximation of the document matrix in the reduced dimensionality space, although
CI’s approximation error is comparable to LSI’s.
Information retrieval using LSI method is more effective on greater levels of

recall than information retrieval in the original space, because it addresses the
problem of synonymy, which affects recall. All we generally acquire is better result
performing LSI for information retrieval, than CI.
We have also compared clustering algorithms (fuzzy and spherical) for CI. CI
using fuzzy k-means clustering gives slightly better results in IR evaluation then CI
using spherical clustering. Also we have shown that spherical clustering algorithm is
faster and needs less memory than fuzzy clustering algorithm. Better IR evaluation
results are acquired for MEDLINE than for CRANFIELD, using both reduction
dimensionality methods (CI and LSI).
Regarding folding-in documents into the document collection without re-

computation of the transformation matrix (truncated SVD), best IR evaluation results
are acquired for 80% initial documents by our experiments. This points that folding-in
without re-computation and without correction of the transformation matrix, gives
solid results only for higher percentage of initial documents.
From the results of experiments we have performed can be concluded that LSI and
CI can be compared, regarding IR.
84
9. REFERENCES
[1] M. W. Berry, Z. Drmač, E. R. Jessup, Matrices, Vector Spaces and Information

Retrieval, SIAM Review, Vol. 41, no. 2. pp. 335–362, 1999.
[2] M. W. Berry, S. T. Dumais, G. W. O’Brien, Using Linear Algebra for Intelligent

Information Retrieval, SIAM Review., Vol. 37, no. 4, pp. 573–595., 1995.
[3] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman,

Indexing by latent semantic analysis, J. American Society for Information
Science, Vol. 41, no. 6, pp. 391-407., 1990.
[4] I.S. Dhillon, J. Fan, Y. Guan, Efficient Clustering of Very Large Document
Collections, Data Mining for Scientific and Engineering Applications, Kluwer
Academic Publishers, pp. 357–381, 2001.
[5] I.S. Dhillon, Y. Guan and J. Kogan, Refining Clusters in High Dimensional Text
Data, 2nd SIAM International Conference on Data Mining (Workshop on
Clustering High-Dimensional Data and its Applications), April 2002.
[6] I.S. Dhillon, S. Mallela, R. Kumar, A divisive information-theoretic feature

clustering algorithm for text classification, Journal of Machine Learning
Research: Special Issue on Variable and Feature Selection, Vol. 3, pp. 1265–
1287., March 2003.
[7] I.S. Dhillon, D. S. Modha, Concept decomposition for large sparse text data
using clustering, Machine Learning, Vol. 42, no. 1, pp. 143-175., 2001.
[8] J. Dobša, B. Dalbelo Bašić, Comparison of information retrieval techniques:

Latent semantic indexing and concept indexing, Journal of Information and
Organizational Sciences 28 1-2, pp. 1-15., 2004.
85
[9] J. Dobša, B. Dalbelo Bašić, Concept decomposition by fuzzy k-means
algorithm, Proceedings of IEEE / WIC International Conference on Web
Intelligence, pp. 684-688, Halifax, Canada, 2003.
[10] Gmeans: clustering in ping-pong style,

http://www.cs.utexas.edu/users/yguan/datamining/gmeans.html [19.6.2006].
[11] Intel® Math Kernel Library, http://www.intel.com/cd/software/products/asmo-

na/eng/perflib/mkl/index.htm [19.6.2006].
[12] A.K. Jain, M.N. Murty, P.J. Flynn, Data Clustering: A Review, ACM Computing
Surveys, Vol 31, No. 3, pp. 264-323, 1999.
[13] G. Karypis, E. - H. S. Han, Concept Indexing: A Fast Dimensionality Reduction

Algorithm with Applications to Document Retrieval and Categorization,
Technical Report TR-00-0016, University of Minnesota, 2000.
[14] J. Kogan, M. Teboulle, C. Nicholas, Data driven similarity measures for k-

means like clustering algorithms, Information Retrieval, Vol. 8, 331-349., 2005.
[15] T.A. Letsche, M.W. Berry, Large-scale Information Retrieval with Latent
Semantic Indexing, Information Sciences - Applications, 1997.
[16] NSF Research Awards Abstacts 1990–2003,

http://www.ics.uci.edu/~kdd/databases/nsfabs/nsfawards.html [19.6.2006].
[17] F. Sebastiani, A Tutorial on Automated Text Categorisation, Proceedings of

ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires,
AR, pp. 7-35., 1999.
[18] F. Sebastiani, Machine Learning in Automated Text Categorisation, ACM

Computing Surveys, Vol. 34, No. 1, pp. 1–47., March 2002.
[19] SMART kolekcije, ftp://ftp.cs.cornell.edu/pub/smart/ [19.6.2006].
86
[20] Latent Semantic Indexing Web Site, http://www.cs.utk.edu/~lsi/, [19.6.2006].
[21] S. Dominich, Information Retrieval - An Advanced Course, 2005.
[22] T. Hofmann, Introduction to Machine Learning, November 2003.
[23] G. Salton, C. Buckley, Term-weighting approaches in automatic retrieval,

Information Processing & Management, Vol. 24, No. 5, pp. 513–523, 1988.
[24] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval, McGraw-

Hill Co., New York, 1983.
[25] T.G. Kolda, D.P. O’Leary, A semi-discrete matrix decomposition for latent
semantic indexing in information retrieval, ACM Transcations on Information
Systems, Vol. 16, pp. 322–346, 1998.
[26] J.C. Bezdek, A convergence theorem for the fuzzy ISODATA clustering
algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 2, No. 1, pp. 1–8., 1990.
[27] J.C. Bezdek, R.J. Hathway, Convergence theory for fuzzy c-means:
Counterexamples and repairs, IEEE Trans. Systems, Man, Cybernetics, Vol.
17, No. 5, pp. 873–877., 1987.
[28] C.L. Lawson, R. J. Hanson, Solving Least Squares Problems, SIAM,

Philadelphia, 1995.
[29] C. Eckart, G. Young, The approximation of one matrix by another of lower

rank, Psychometrika, Vol. 1, pp. 211-218., 1936.
[30] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press,

Addison-Wesley, New York, 1999.
87
[31] M.E. Maron, J.L. Kuhns, On relevance, probabilistic indexing and information
retrieval, Association for Computing Machinery 7 (1960)3, pp. 216-244.
[32] Y. Yang. An evaluation of statistical approaches to text categorization.

Information retrieval, 1-2(1):69–90, 1999.
[33] L.D. Baker, A.K. McCallum, Distributional clustering of words for text
categorisation, Proceedings of SIGIR-98, 21st ACM International Conference
on Research and Developement in Information Retrieval, pp. 96-103,
Melbourne, Australia, 1998.
[34] J. Yen, R. Langari, Fuzzy Logic: Intelligence, Control and Information, Prantice
Hall, New Jersey, 1999.
[35] H. Kim, P. Howland, H. Park, Dimension reduction in text classification using

support vector machines, Journal of Machine Learning Research 6 (2005), pp.
37-53.
[36] H. Park, M. Jeon, J. Ben Rosen, Lower dimensional representation of text data
based on centroids and least squares, BIT, 43 (2003) 3, pp. 1-22.
[37] C. Park, H. Park, Nonlinear feature extraction based on centroids and kernel
functions, Pattern Recognition 37 (2004)4, pp. 801-810.
[38] J. Ye, Q. Li, H. Xiong, H. Park, R. Janardan, V. Kumar, IDR/QR: An

incremental dimension reduction algotithm via QR decomposition, IEEE
Transactions on Knowledge and Data Engineering 17 (2005)9, pp.1208-1222.
88
TABLES INDEX
Table 2.1 Documents and their categorization (example 2.1) ................................................ 11

Table 2.2 Document ranking by similarity with Q1 and Q2 (example 2.1)............................. 12
Table 2.3 IDF components of index terms (example 2.2) ........................................................ 13
Table 2.4 Document ranking by similarity with Q1 and Q2 (example 2.2) .............................. 14
Table 2.5 The contingency table for one query ....................................................................... 17
Table 4.1 Coordinates of the terms by SVD and CDFKM ...................................................... 38

Table 4.2 Coordinates of documents and queries by SVD and CDFKM ................................ 38
89
FIGURES INDEX
Figure 2.1 Document-term matrix ............................................................................................. 8

Figure 2.3 The first document from MEDLINE document collection ..................................... 19
Figure 2.4 Some of the queries for MEDLINE document collection ....................................... 19
Figure 2.5 The first document from CRANFIELD document collection ................................. 20
Figure 2.6 Some of the queries for CRANFIELD document collection ................................. 20
Figure 3.1 Process of data clustering ...................................................................................... 22

Figure 3.2 taxonomy of clustering approaches ....................................................................... 23
Figure 4.1 Reduced singular value decomposition (truncated SVD) ...................................... 27

Figure 4.2 Images of terms by LSI ........................................................................................... 39
Figure 4.3 Images of terms by CI ............................................................................................ 39
Figure 4.4 Images of documents and queries by LSI ............................................................... 41
Figure 4.5 Images of documents and queries by CI ................................................................ 41
Figure 5.1 Single test tab-window screenshot ......................................................................... 46

Figure 5.2 Documents before clustering (example type-1) ..................................................... 47
Figure 5.3 Documents after clustering (example type-1) ........................................................ 47
Figure 5.4 Document and query representations in CI space (example type-2) ..................... 48
Figure 5.5 Term representations in CI space (example type-2) .............................................. 49
Figure 5.6 Document and query representations in LSI space (example type-2) ................... 50
Figure 5.7 Term representations in LSI space (example type-2) ............................................ 51
Figure 5.8 Batch test tab-window (Options) screenshot ......................................................... 52
Figure 5.9 Batch test tab-window (Test list and results) screenshot ....................................... 53
Figure 5.10 Batch test tab-window (Graph) screenshot.......................................................... 54
Figure 5.11 Query testing tab-window screenshot .................................................................. 55
Figure 6.1 Relation between the number of concept/singular vectors and the decomposition
error for MEDLINE ................................................................................................................. 57
Figure 6.2 Relation between the number of concept/singular vectors and the decomposition
error for CRANFIELD ............................................................................................................. 58
Figure 6.3 Relation between the number of concept vectors and the average concept vectors
dot product for MEDLINE ....................................................................................................... 59
Figure 6.4 Relation between the number of concept vectors and the average concept vectors
dot product for CRANFIELD ................................................................................................... 60
Figure 6.5 Relation between the number of concept vectors and the number of iterations for
MEDLINE................................................................................................................................. 61
Figure 6.6 Relation between the number of concept vectors and the number of iterations for
CRANFIELD ............................................................................................................................ 62
Figure 6.7 Relation between the number of concept vectors and the test duration (in seconds)
for MEDLINE ........................................................................................................................... 63
Figure 6.8 Relation between the number of concept vectors and the test duration (in seconds)
for CRANFIELD ....................................................................................................................... 64
Figure 6.9 Relation between number of concept vectors and memory consumption for
MEDLINE................................................................................................................................. 65
Figure 6.10 Relation between number of concept vectors and memory consumption for
CRANFIELD ............................................................................................................................ 66
90
Figure 6.11 Relation between the number of concept/singular vectors and the average recall
of all queries for MEDLINE ..................................................................................................... 68
Figure 6.12 Relation between the number of concept/singular vectors and the average recall
of all queries for CRANFIELD................................................................................................. 69
Figure 6.13 Relation between the number of concept/singular vectors and the average
precision of all queries for MEDLINE ..................................................................................... 70
Figure 6.14 Relation between the number of concept/singular vectors and the average
precision of all queries for CRANFIELD ................................................................................. 71
Figure 6.15 Relation between the number of concept/singular vectors and the average MAP
of all queries for MEDLINE ..................................................................................................... 72
Figure 6.16 Relation between the number of concept/singular vectors and the average MAP
of all queries for CRANFIELD................................................................................................. 73
Figure 6.17 Relation between the number of concept/singular vectors and the average F1 of
all queries for MEDLINE ......................................................................................................... 74
Figure 6.18 Relation between the number of concept/singular vectors and the average F1 of
all queries for CRANFIELD ..................................................................................................... 75
Figure 6.19 Recall-precision graph for MEDLINE................................................................. 76
Figure 6.20 Recall-precision graph for MEDLINE................................................................. 77
Figure 6.21 Relation between the initial document percentage and recall ............................. 78
Figure 6.22 Relation between the initial document percentage and precision ....................... 79
Figure 6.23 Relation between the initial document percentage and MAP .............................. 80
Figure 6.24 Relation between the initial document percentage and F1 .................................. 81
91

Comparison of Different Dimensionality Reduction Methods For Information Retrieval and Text Mining

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Comparison of Different Dimensionality Reduction Methods For Information Retrieval and Text Mining

Hochgeladen von

Copyright:

Verfügbare Formate

UNIVERSITY OF ZAGREB

FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING

GRADUATE THESIS No. 1599

Comparison of different dimensionality

Zagreb, July, 2006.

I also want to thank to mr. sc.

2. INFORMATION RETRIEVAL ................................................................................ 5

2.1. VECTOR SPACE MODEL ................................................................................... 7

4. DIMENSIONALITY REDUCTION ....................................................................... 24

4.1. LATENT SEMANTIC INDEXING .................................................................... 26

4.2.1. SPHERICAL K-MEANS ............................................................................................ 31

4.3. LSI AND CI COMPARISON EXAMPLE .......................................................... 37

5. SYSTEM IMPLEMENTATION ............................................................................. 44

6. EXPERIMENTS AND RESULTS .......................................................................... 56

6.1. DECOMPOSITION ERROR .............................................................................. 57

6.4.1. WITHOUT FOLDING-IN DOCUMENTS ................................................................ 68

7. RELATED WORKS ................................................................................................. 82

TABLES INDEX .................................................................................................................... 89

FIGURES INDEX .................................................................................................................. 90

Large collections of documents are becoming increasingly common. The public

Manual construction of structured ontologies is one possible solution and has

Text mining is a constituent discipline of data mining, which is content-based and

Information retrieval discipline deals with presentation, storage and information

1. IR-style indexing is always (uniformly, i.e. by means of the same technique)

2. IR-style techniques (such as document-request matching, query expansion,...)

3. IR-style evaluation of the effectiveness of the classifiers is performed.

Document preprocessing is necessary before creating any information retrieval

Document representation in a vector space model is often called a bag of words

The column of the document-term matrix represents one document from

Intuitively, the effect of normalization is to retain only the direction of the

1. considered only terms that occur at least in two documents

 Q1: Data mining

Survey of text mining: clustering, classification,

Table 2.1 Documents and their categorization (example 2.1)

First of all, document-term matrix F is formed, where matrix component at (i, j)

Documents ranking by relevance for queries Q1 and Q2 are shown in Table

D15 1.4142 D15 1.4142

Table 2.2 Document ranking by similarity with Q1 and Q2 (example 2.1)

Table 2.3 IDF components of index terms (example 2.2)

q1 = (0, 1.6094, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3218, 0)T,

D15 2.0826 D15 2.0826

Table 2.4 Document ranking by similarity with Q1 and Q2 (example 2.2)

In this section criterions and information retrieval evaluation techniques are

 document collections (science article abstracts, reports or news articles)

As in the case of information retrieval systems, the evaluation of document

 Efficiency is a notion dependent on the hw/sw technology used. Once this

Retrieved YES TPi FPi

Table 2.5 The contingency table for one query

P(ri )  maxP(r ) : ri  r  ri1, i  0,1,...,9 (2. 10)

Precision on 100% recall level is equal to the precision value of acquired

Figure 2.3 The first document from MEDLINE document collection

Figure 2.4 Some of the queries for MEDLINE document collection

Figure 2.5 The first document from CRANFIELD document collection

Figure 2.6 Some of the queries for CRANFIELD document collection

Clustering is the unsupervised classification of patterns (observations, data items,

Typical pattern clustering activity involves the following steps [12]:

(1) pattern representation (optionally including feature extraction and/or selection),

Pattern representation refers to the number of classes, the number of

Pattern proximity is usually measured by a distance function defined on pairs

Figure 3.2 taxonomy of clustering approaches

One of the famous clustering systems is G-means (clustering in ping-pong style)

1. m~ << m term selection, feature selection

Also dimensionality reduction methods can be supervised, if using information