Beruflich Dokumente
Kultur Dokumente
Goran Jovanov
1
CONTENT
1. INTRODUCTION ....................................................................................................... 3
3. CLUSTERING .......................................................................................................... 21
8. CONCLUSIONS ....................................................................................................... 84
9. REFERENCES .......................................................................................................... 85
2
1. INTRODUCTION
3
retrieve all of the relevant documents and the smallest amount of irrelevant
documents. Unfortunately, due to problems such as polysemy (words with multiple
meanings) and synonymy (different words that have the same meaning), a list of
documents retrieved for a given query is almost never perfect, and the user has to
ignore some of the items.
Although there are many other models (see Information retrieval chapter), the
algorithms we are dealing with are embedded in the vector space model. The
documents are represented as vectors of term frequencies and the documents set is
represented as a matrix of document vectors. One of the main problems in
information retrieval with the vector space model of documents is the high
dimensionality of the document-term matrix. The number of documents in document
collections may vary from a few thousand to several hundred thousand, and the
number of terms is often more than a few thousand terms. Hence, reduction of
dimensionality appears to be very useful.
There are many methods for dimensionality reduction, but the most used are
Latent Semantic Indexing (LSI) (see [1], [2], [3], [15], [20], [25]), which is based on
Singular Value Decomposition (SVD), and Concept Indexing (CI) (see [4], [5], [6], [7],
[9]), which is based on Concept Decomposition (CD) using clustering (see Clustering
chapter) k-means algorithms. The comparison of these two methods ([8]) is the main
objective of this thesis.
4
2. INFORMATION RETRIEVAL
The machine learning approach ([18], [22]) to classifier construction heavily relies on
the basic machinery of information retrieval. The reason is that both information
retrieval and document categorization are content-based document management
tasks, and therefore share many characteristics. Information retrieval techniques are
used in three phases of the classification task:
(1) lexical text processing (eliminating punctuation marks and numbers, ignoring
case)
(2) eliminating non-content-bearing words such as conjunctions, prepositions and
any similar words which generally have low semantic value in text exploration
(so called stop words)
(3) reducing words to their basic form such as stemming or lemmatization
(4) index term selection, e.g. preferring nouns and eliminating other forms of
those words
(5) construction and usage of the synonym associative term sets glossary called
thesaurus
5
In the following section different models of information retrieval are presented and
discussed, with regard to the initially postulated presumption. An information retrieval
model can be precisely defined in the following way:
Definition 2.1 [30] Information retrieval model is an ordered quadrille (Dr, Qr, F, g(q,
a)) where:
- Dr is representation of a document set,
- Qr is representation of a query set,
- F is set of rules for document and query presentation modeling and the
relationship between them,
- g(q, a) : Qr x Dr R is real function which defines the document order by
relevance of certain query, also called decision function.
Three classical models for the information retrieval discipline are: probabilistic,
logic and vector space. In the probabilistic model the set of rules for modeling
document and query representation is based on the probabilistic theory. Moreover, in
the logic model, documents and queries are represented with an index term set, so
this model is based on the set theory. And finally in the vector space model
documents and queries are represented as multidimensional vectors, thus this model
is linear algebra based. Algorithms used in this thesis are embedded in a vector
space model.
6
2.1. VECTOR SPACE MODEL
Today, the vector space model is the most popular in the information retrieval
discipline. Unlike in other models index terms in documents and queries are assigned
with weights (real interval values) and the similarity measure value obtains value from
[0, 1] interval.
Let T = {t1, t2, .., tm} denote index terms set and D = {d1, d2, ..., dn} the documents
set from the documents collection. Furthermore, let a ij denote weights that are
assigned to pair (ti, dj), for i = 1, 2, ..., m and j = 1, 2, ..., n. The weight values are real
and positive. Index terms in the representation of query q are also assigned with
weights. Let with qi, for i = 1, 2, ..., m be denoted weights assigned to the (ti, q) pair.
Definition 2.2 [31] In vector space model documents dj, j = 1, 2, .., n are represented
with vectors form as aj = (a1j, a2j, ..., amj)T, and queries as q = (q1, q2, ..., qm)T.
Document collection D is represented with document-term matrix form A = [aij] = [a1
a2 ... an] (see Figure 3.1).
7
a11 a1n t 1
A aij ti
a m1 a mn t m
a1 a j a n
Figure 2.1 Document-term matrix
Definition 2.3 [31] Similarity measure between document aj and query q is defined
as cosine of the angle between their vector representations
aj q
T a ij qi
sim(d j , q) cos((a j , q)) i 1
|| a j || || q || m m (2. 1)
a q
2 2
ij i
i 1 i 1
where ||aj|| and ||q|| are Euclidian norms of the vector’s representation of documents
and queries.
Since aij and qi values are positive, similarity measure value is from the [0, 1]
interval. Similarity measure values close to 1 mean better matching between
documents dj from document collection and query q. In practice similarity limit t is
often defined where retrieved documents dj for certain query q, have similarity
measure values from [t, 1] interval. Another approach for filtering and ranking
retrieved documents is to sort descending documents by similarity measure value
and to retrieve only k documents with best similarity measure values.
8
The above preprocessing scheme yields the number of occurrences of word j
in document i, say, fji, and the number of documents which contain the word j, say, dj.
Using these counts, we now create n document vectors in Rd, namely, a1, a2,..., an as
follows. For 1 ≤ j ≤ d, set the j-th component of document vector xi, 1 ≤ i ≤ n, to be the
product of three terms
xji = tji x gj x si (2. 2)
where tji is the term weighting component and depends only on fji, gj is the global
weighting component and depends on dj, and si is the normalization component for
ai. Intuitively, tji captures the relative importance of a word in a document, while gj
captures the overall importance of a word in the entire set of documents. The
objective of such weighting schemes is to enhance discrimination between various
document vectors and to enhance retrieval effectiveness.
There are many schemes for selecting the term, global, and normalization
components, for example, ([4], [7], [17], [18]) presents 5, 5, and 2 schemes,
respectively, for the term, global, and normalization components-a total of 5 x 5 x 2 =
50 choices. From this extensive set, we will use two popular schemes denoted as txn
and tfn, and known, respectively, as normalized term frequency and normalized
term frequency-inverse document frequency. Both schemes emphasize words
with higher frequencies, and use tji = fji. The txn scheme uses gj = 1, while the tfn
scheme emphasizes words with low overall collection frequency and uses formula
2.3 (dj – number of documents in which index term occurs; n – number of documents
in document collection). In both schemes, each document vector is normalized to
have unit L2 norm, that is,
n n
g j log
dj
(2. 3) si (t
j 1
ji g j ) 2 (2. 4)
9
2.2. VECTOR SPACE MODEL EXAMPLES
Example 2.1 [8] The document collection in this example is composed of 15 books or
article titles divided into two clusters. The first cluster is composed of 9 Data mining
(DM) documents, the second cluster contains 5 documents related to linear algebra
(LA) and document D6 (matrices, vector spaces, and information retrieval) is a
combination of both mentioned disciplines (data mining and linear algebra). The
index terms list is formed in three steps:
As a result we get the following index term list: 1) text, 2) mining, 3) clustering, 4)
classification, 5) retrieval, 6) analysis, 7) information, 8) linear, 9) algebra, 10) matrix,
11) application, 12) document, 13) vector, 14) space, 15) data and 16) algorithm.
Documents and their categorization are shown in Table 3.1. Two queries shall be
presented in order to illustrate the information retrieval process:
Relevant documents for query Q1 are DM and with both categories categorized
documents, whereas only document D6 is relevant for query Q2. In most of the
relevant documents for Q1 are not contained index terms contained in Q1 (but those
documents contain index term such as clustering, classification, information, retrieval,
which are relevant for DM discipline). In D6, which is relevant for Q2, is not contained
any index term contained in Q2.
10
Label Category Document
q1 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)T,
q2 = (0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0)T.
11
Before computing similarities between queries Q1 and Q2 and documents Di (1
≤ i ≤ 15) from the document collection, documents and queries vectors are
transformed into txn form. Because now document and query vectors are unit
vectors, as a similarity measure is used inner product between vectors.
a b
cos((a, b)) a b (2. 5)
|| a || || b ||
Query Q1 Query Q2
Document Inner product Document Inner product
12
From this example we see the evident disadvantage of vector space model,
and that is the fact that the only retrieved documents are those that contain index
terms contained in queries (lexically bounded). Documents that are bounded by
semantics, but do not contain index terms from the query (e.g. synonyms), shall not
be retrieved.
Example 2.2 [8] This example illustrates the application of global weighting, IDF
component. The same document collection is used, with the same queries Q 1 and
Q2, where documents collection uses tfn and queries use tfx weighting function. TF
component represents index term occurrence in a certain document. Furthermore,
IDF is computed by formula 2.9 and normalization component is computed by 2.10.
IDF components of index terms are shown in Table 2.3.
dj – number of documents in
Term IDF component
which term occurs
text 4 1.3218
mining 3 1.6094
clustering 4 1.3218
classification 2 2.0149
retrieval 4 1.3218
analysis 3 1.6094
information 3 1.6094
linear 2 2.0149
algebra 5 1.0986
matrix 4 1.3218
application 2 2.0149
document 2 2.0149
vector 3 1.6094
space 3 1.6094
data 4 1.3218
algorithm 2 2.0149
13
The final document-term matrix A is attained by multiplying F matrix rows with
corresponding IDF components and by normalizing columns. Query vectors q1 and
q2 from the example 2.1 are multiplied with corresponding IDF components, so as a
result of vector representation of queries we get:
The result retrieved documents ranked by relevance for queries q1 and q2 are
demonstrated in Table 2.4.
Query Q1 Query Q2
Document Inner product Document Inner product
14
In this example we notice a slightly different ranking of retrieved documents
then in example 2.1. Moreover, the same documents have non zero similarity and the
few top documents have the same order ranking like in example 2.1.
15
2.3. EVALUATION
16
Classification effectiveness is measured in terms of the classic IR notions of
precision p and recall r. In order to precisely define mentioned measures, we
denote with A set of retrieved documents for a certain query. Moreover, we denote
with R relevant documents set and with Rα we denote the intersection of two
mentioned sets (Rα = A ∩ R). With |A|, |R| and |Rα| we denote cardinalities of those
sets. So recall r is computed as follows:
n
|R |
TP i
r (2. 6)
r n
i 1
n (2. 7)
|R|
TPi FN i
i 1 i 1
|R |
TP i
p (2. 8)
p n
i 1
n (2. 9)
| A|
TPi FPi
i 1 i 1
TP (true positive), FP (false positive), TN (true negative) and FN (false negative) are
described in Table 2.5.
Relevant document
Documents di, 1 ≤ i ≤ n
YES NO
Generally, with recall up growth, precision decreases and vice versa. These
measurements are closely related to each other and are computed at the same time.
Besides that, the set A is never retrieved to user at once, yet ranked document list
ordered by decreasing similarity is retrieved. Then while the user examines this list in
top-down order, precision and recall vary. Quality information retrieval insight can be
17
acquired by average precision measurement on different recall levels. Usually on 11
standard levels of recall average precision is computed. Levels: 0%, 10%, 20%, ...,
100%. Let’s denoted rk, k = 0, 1, ..., 10 k-th standard recall level. When the user
iterates through the list of retrieved documents for a certain query, usually recall does
not occur in some of the standard levels. So for computing precision on given level
P(ri) for i = 0, 1, ..., 9, the following formula is used:
1 10
P P(ri ) (2. 11)
11 i 0
Effectiveness can be measured also as the value of the Fα function, for some 0
≤ α ≤ 1, where Fα is defined as follows:
1
F (2. 12)
1 1
(1 )
p r
In this formula α may be seen as the relative degree of importance attributed to p and
r: If α = 1, then Fα coincides with p, if α = 0 then Fα coincides with r. Usually, a value
of α = 0.5 is used, which attributes equal importance to p and r; for reasons we do not
want to enter here, rather than F0.5 this is usually called F1 (see [17], [18] for details).
As shown in [32], for a given classifier Φ, its breakeven value is always less or equal
than its F1 value.
18
2.4. EXAMPLE OF DOCUMENTS AND QUERIES
Example 2.3 This example shows documents and queries from the MEDLINE and
CRANFIELD collection. Label .I remarks new document and the number by this label
remarks ordinal number of document. Furthermore, label .W remarks the beginning
of the document text. In Figure 2.3 we see that document is relatively short and that
contains very specific terms such as fetal, plasma, glucose, fatty, acids. Also we are
able to see in Figure 2.4 that queries are very short and clear, and contain specific
terms. Fifth query contains term fetus, while the first document, which is relevant for
the fifth query contains term fetal (lexically different from the term fetus, but has same
semantic).
.I 1
.W
correlation between maternal and fetal plasma levels of glucose and free fatty
acids .
correlation coefficients have been determined between the levels of
glucose and ffa in maternal and fetal plasma collected at delivery .
significant correlations were obtained between the maternal and fetal
glucose levels and the maternal and fetal ffa levels . from the size of
the correlation coefficients and the slopes of regression lines it
appears that the fetal plasma glucose level at delivery is very strongly
dependent upon the maternal level whereas the fetal ffa level at
delivery is only slightly dependent upon the maternal level .
.I 1
.W
the crystalline lens in vertebrates, including humans.
.I 3
.W
electron microscopy of lung or bronchi.
.I 4
.W
tissue culture of lung or bronchial neoplasms.
.I 5
.W
the crossing of fatty acids through the placental barrier. normal
fatty acid levels in placenta and fetus.
.I 10
.W
neoplasm immunology.
19
In Figure 2.5 we see the first document and in Figure 2.6 some of the queries from
the CRANFIELD document collcetions. The label .T remarks the author of the
document. Unlike MEDLINE we can see that CRANFIELD documents and queries
contain less specific terms.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream .
.A
brenckman,m.
.W
experimental investigation of the aerodynamics of a
wing in a slipstream .
an experimental study of a wing in a propeller slipstream was
made in order to determine the spanwise distribution of the lift
increase due to slipstream at different angles of attack of the wing
and at different free stream to slipstream velocity ratios . the
results were intended in part as an evaluation basis for different
theoretical treatments of this problem .
the comparative span loading curves, together with
supporting evidence, showed that a substantial part of the lift increment
produced by the slipstream was due to a /destalling/ or
boundary-layer-control effect . the integrated remaining lift
increment, after subtracting this destalling lift, was found to agree
well with a potential flow theory .
an empirical evaluation of the destalling effects was made for
the specific configuration of the experiment .
.I 001
.W
what similarity laws must be obeyed when constructing aeroelastic models
of heated high speed aircraft .
.I 002
.W
what are the structural and aeroelastic problems associated with flight
of high speed aircraft .
.I 004
.W
what problems of heat conduction in composite slabs have been solved so
far .
.I 008
.W
can a criterion be developed to show empirically the validity of flow
solutions for chemically reacting gas mixtures based on the simplifying
assumption of instantaneous local chemical equilibrium .
20
3. CLUSTERING
Clustering is the task of organizing a set of objects into meaningful groups. These
groups can be disjoint, overlapping, or organized in some hierarchical fashion. The
key element of clustering is the notion that the discovered groups are meaningful.
This definition is intentionally vague, as what constitutes meaningful is to a large
extent, application dependent. In some applications this may translate to groups in
which the pair wise similarity between their objects is maximized, and the pair wise
similarity between objects of different groups is minimized. In some other applications
this may translate to groups that contain objects that share some key characteristics,
although their overall similarity is not the highest.
Figure 3.1 depicts a typical sequencing of the first three of these steps, including
a feedback path where the grouping process output could affect subsequent feature
extraction and similarity computations.
21
Figure 3.1 Process of data clustering
The grouping step can be performed in a number of ways. The output cluster
(or clusters) can be hard (a partition of the data into groups) or fuzzy (where each
pattern has a variable degree of membership in each of the output clusters).
Hierarchical clustering algorithms produce a nested series of partitions based on a
criterion for merging or splitting clusters based on similarity. Partitional clustering
algorithms identify the partition that optimizes (usually locally) a clustering criterion.
Additional techniques for the grouping operation include probabilistic and graph-
theoretic clustering methods.
22
Data abstraction is the process of extracting a simple and compact
representation of a data set. Here, simplicity is either from the perspective of
automatic analysis (so that a machine can perform further processing efficiently) or it
is human-oriented (so that the representation obtained is easy to comprehend and
intuitively appealing). In the clustering context, a typical data abstraction is a compact
description of each cluster, usually in terms of cluster prototypes or representative
patterns such as the centroid.
Different approaches to clustering data can be described with the help of the
hierarchy shown in Figure 3.2. At the top level, there is a distinction between
hierarchical and partitional approaches (hierarchical methods produce a nested
series of partitions, while partitional methods produce only one).
23
4. DIMENSIONALITY REDUCTION
Different techniques for dimensionality reduction in vector space model have been
developed. There are many motives for dimensionality reduction such as: memory
space reduction for document representation, better information retrieval or
classification performance, noise and redundancy elimination in document
representation etc. Although dimensionality reduction means information reduction,
often results with more efficiency in information retrieval and classification, which
shall be confirmed later in this thesis. If dimensionality of the original vector space is
equal to index term count, dimensionality reduction can be performed in the following
two manners:
24
This thesis is based on term extraction information retrieval methods of latent
semantic indexing and concept indexing. The method of LSI was introduced in
1990 [33] and improved in 1995 [30]. It represents documents as approximations and
tends to cluster documents on similar topics even if their term profiles are somewhat
different. This approximate representation is accomplished by using a low-rank
singular value decomposition (SVD) approximation of the term-document matrix.
Although LSI method has empirical success, it suffers from the lack of interpretation
for the low-rank approximation and, consequently, the lack of controls for
accomplishing specific tasks in information retrieval. The explanation of Latent
Semantic Indexing efficiency in terms of multivariate analysis is provided in [3], [15],
[16]. A method by Dhillon and Modha [7] uses centroids of clusters created by the
spherical k-means algorithm or so-called concept decomposition (CD) for lowering
the rank of the term-document matrix. Applying this method, the space on which the
term-document matrix is projected is more interpretable. Namely, it is a space spread
by centroids of clusters. The information retrieval technique using concept
decomposition is called concept indexing (CI). Furthermore, the concept
decomposition method is computationally more efficient and requires less memory
than LSI.
25
4.1. LATENT SEMANTIC INDEXING
Let the m × n matrix A = [aij] be the term-document matrix. Then aij is the weight of
the i-th term in the j-th document. The standard procedure is to normalize the
columns of the matrix to be of unit norm. The term-document matrix has an important
property of being sparse, i.e. most of its elements are zeros.
A query has the same form as a document; it is a vector, which on the i-th place
has the frequency of the i-th term in the query. We never normalize the vector of the
query because it has no effect on document ranking. A common measure of similarity
between the query and the document is the cosine of the angle between them.
A U Σ VT (4.1)
where p = min{m, n} and σ1 ≥ σ2 ≥ ...≥ σp ≥ 0. The σi are the singular values and ui
and vi are the i-th left singular vector and the i-th right singular vector respectively.
The second fundamental result [29] is the theorem by Eckart and Young, which
states that the distance in the Frobenius norm between A and its k-rank
approximation is minimized by the approximation Ak. Here
26
Ak Uk Σk Vk
T
(4.3)
where Uk is the m × k matrix which columns are the first k columns of U, Vk is the n ×
k matrix which columns are the first k columns of V, and Σk is the k × k diagonal
matrix which diagonal elements are the k largest singular values of A. More precisely,
Uk (m × k) Σk (k × k)
VkT (k × n)
= x x
A U Σ
Document
vectors
Term
vectors
A (m × n) U (m × m) Σ (m × n) VT (n × n)
The ranking of documents, according to their relevance for a query using LSI
method, is executed by calculating the score vector
s q T U k k V Tk (4.5)
27
LSI Algorithm:
28
4.2. CONCEPT INDEXING
Cluster centroids are also called concept vectors. Concept vectors represent
concepts from the clustering index terms and can be used as a model for
classification of the later folded documents into the documents collection. One of the
main CI advantages over LSI is the interpretability of the concept vector because
they are local, unlike LSI’s singular vectors which are not interpretable and they are
global. Furthermore, CI is less complex and uses less memory then LSI [7]. While
LSI is well theoretically based, CI has no theoretical baseline.
29
If we assume linear independence of the concept vector then it follows that the
concept matrix has rank k . Now we define concept decomposition Dk of the
document-term matrix A as the least-squares approximation of A on the column
space of the concept matrix Ck. Concept decomposition is m x n matrix
Dk Ck Z
*
(4.7)
Z* min A Ck Z (4.8)
Z
that is
Z* Ck Ck
T
1 T
Ck A (4.9)
In this thesis two types of CI algorithms are used. One is spherical k-means
and the other is fuzzy k-mean. In following subsections both types are described.
30
4.2.1. SPHERICAL K-MEANS
Suppose we are given n document vectors a1, a2, ..., an in Rm ≥ 0. Let π1, π2, ..., πk
denote a partitioning of the document vectors into k disjoint clusters such that ([7])
a , a
j 1
j 1 2 ,..., a n and j l if i l . (4.10)
For each fixed 1 ≤ j ≤ k, the mean vector or the centroid of the document vectors
contained in the cluster πj is
1
mj
nj
a
a j
(4.11)
where nj is the number of document vectors in πj . Note that the mean vector mj need
not have a unit norm; we can capture its direction by writing the corresponding
concept vector as
mj
cj (4.12)
|| m j ||
The concept vector cj has the following important property. For any unit vector
z in Rm, we have from the Cauchy-Schwarz inequality that
a
a
T
z a
a
T
cj . (4.13)
j j
Thus, the concept vector may be thought of as the vector that is closest in
cosine similarity (in an average sense) to all the document vectors in the cluster π j.
Motivated by (4.9), we measure the “coherence” or “quality” of each cluster π j 1 ≤ j ≤
k, as
31
a
a T
cj (4.14)
j
Observe that if all document vectors in a cluster are identical, then the average
coherence of that cluster will have the highest possible value of 1. On the other hand,
if the document vectors in a cluster vary widely, then the average coherence will be
small, that is, close to 0. Since a
a
T
c j n j m j and ||cj|| = 1, we have that
j
a c j n j m j c j n j m j c j c j n j m j a
T T T T
cj (4.15)
a j a j
This rewriting yields the remarkably simple intuition that the quality of each
cluster πj is measured by the L2 norm of the sum of the document vectors in that
cluster. We measure the quality of any given partitioning {πj}kj=1 using the following
objective function:
Q j a
k
k T
j 1 cj (4.16)
j 1 a j
Intuitively, the objective function measures the combined coherence of all the
k clusters. Such an objective function has also been proposed and studied
theoretically in the context of market segmentation problems (Kleinberg et al., 1998).
32
Spherical k-means algorithm:
1) Initialize clustering. Start with some initial partitioning of the document vectors,
namely j ( 0) k
j 1
. Let j ( 0)
k
j 1 . be the concept vectors of the associated
2) For each document vector ai, 1 ≤ i ≤ n, find the concept vector closest in
cosine similarity to ai. Now, compute the new partitioning j ( t 1)
k
j 1 induced
In words, πj(t+1) is the set of all document vectors that are closest to the
concept vector cj(t). If it happens that some document vector is simultaneously
closest to more than one concept vector, then it is randomly assigned to one
of the clusters. Clusters defined using (4.13) are known as Voronoi or Dirichlet
partitions.
( t 1)
( t 1) mj
cj ( t 1)
,1 j k (4.18)
mj
where m j ( t 1) denotes the centroid or the mean of the document vectors in
cluster j ( t 1) .
33
Q j
(t ) k
j 1 Q j
( t 1) k
j 1 , (4.19)
One can either (a) randomly assign each document to one of the k clusters, (b) first
compute the concept vector for the entire document collection and randomly perturb
this vector to get k starting concept vectors or (c) try several initial clusterings and
select the best in terms of the largest objective function. In my implementation I use
strategy (b).
34
4.2.2. FUZZY K-MEANS
The fuzzy k -means algorithm (FKM) (see [9], [26], [27], [34]) generalizes classical or
hard k -means algorithm. The goal of k -means algorithm is to cluster n objects (here
documents) in k clusters and find k mean vectors for clusters (centroids). In the
context of the vector space model for information retrieval we call these mean vectors
concepts. Spherical k -means algorithm used in [7] is just a variation of hard k -
means algorithm which uses a fact that document vectors (and the vectors of
concepts) are of the unit norm.
k n
bij a j ci
2
J fuzz , (4.20)
i 1 j 1
where aj, j=1, …, n are vectors of documents, ci, i=1, .., k are concept vectors, μij is
fuzzy membership degree of document aj in the cluster whose concept is ci and b is a
weight exponent in fuzzy membership. In general, the JFuzz criterion is minimized
when the concept ci is near those points that have high fuzzy membership degree for
J fu zz J fu zz
cluster i, i=1, .., k. By solving a system of equations and we
ci ij
i 1,, k
1
ij 1
,
j 1,, n. (4.21)
b 1
2
k
a j ci
2
a cr
r 1
35
n
j 1
b
ij aj
c i
n
, i 1, , k (4.22)
j 1
b
ij
for which the cost function achieves local minimum. We will get concept vectors using
following iterative procedure:
( t 1)
4. Compute new concept vectors ci according to formula (4.22).
( t 1)
5. Compute new fuzzy membership degrees ij according to formula
(4.21).
( t 1)
6. Compute a new cost function J fuzz according to formula (4.20).
36
4.3. LSI AND CI COMPARISON EXAMPLE
Example 4.1 [8] In this example LSI (SVD) and CI (CDFKM) using fuzzy k-means are
compared for the document collection and queries from example 2.1.
fuzzy k-means, k=2) and truncated SVD (k=2). Let the truncated SVD be U 2 2 V2
T
*
and CDFKM be C2 Z . In truncated SVD, rows of U2 are the approximate (two-
dimensional) representation of terms, while rows of V2 are the approximate (two-
dimensional) representation of documents. Here we neglect Σ 2 part, since Σ2 is a
diagonal matrix and produces only scaling of the axes. In CDFKM, rows of C2 are
approximate representations of terms and columns of Z* are approximate
representations of documents. Coordinates of terms are listed in Table 4.1, while
coordinates of documents and queries are listed in Table 4.2. Also, on Figure 4.2 and
4.3 images of terms are plotted. From Figure 4.2 we can see that images of two
groups of terms, data mining (DM) terms and linear algebra (LA) terms are grouped
together in the case of truncated SVD. In the case of CDFKM, two groups of terms
are generally grouped along the axes: along y axis (and near y axis) we have DM
terms, and along x axis we have LA terms. Exceptions are terms information and
retrieval. Our assumption is that this is because the model was confused by D 6
document, which contains these terms and LA terms.
Most of the DM documents do not contain words data and mining. Such
documents will not be recognized by the simple term-matching vector space method
as relevant. Document D6, relevant for Q2, does not contain any of terms from the list
contained in the query. In the vector space model, the query has the same form as
the document. Let q be a representation of the query in the vector space model and
q~ its approximate representation using truncated SVD.
37
Term SVD CDFKM
xi yi xi yi
text 0.21 -0.31 0.10 0.43
mining 0.16 -0.29 0.01 0.42
clustering 0.24 -0.41 0.08 0.48
classification 0.12 -0.18 0.00 0.23
retrieval 0.27 -0.20 0.29 0.11
analysis 0.21 -0.11 0.29 0.00
information 0.09 -0.16 0.00 0.32
linear 0.25 -0.41 0.10 0.47
algebra 0.19 0.14 0.20 0.00
matrix 0.50 0.40 0.54 0.00
application 0.37 0.25 0.40 0.00
document 0.29 0.19 0.32 0.00
vector 0.29 0.20 0.32 0.00
space 0.21 -0.09 0.19 0.12
data 0.19 0.14 0.20 0.00
algorithm 0.11 -0.17 0.18 0.07
38
0,5
0,4
0,3
0,2 DM terms
0,1 LA terms
0,0
y
-0,1
-0,2
-0,3
-0,4
-0,5
0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5
x
0,50
0,45
0,40
DM terms
0,35
LA terms
0,30
0,25
y
0,20
0,15
0,10
0,05
0,00
0 0,1 0,2 0,3 0,4 0,5 0,6
x
39
Then, the following is satisfied
q U2 Σ2 q
~ T q
~ qT 1
U 2 Σ2
documents and queries are plotted. In the SVD projection, DM documents form one
group, LA documents another and the D6 document is isolated. In the CD projection,
LA documents are grouped; DM documents are somewhat more dispersed, while D 6
document is in the group of LA documents. Shaded areas represent the area of
relevant documents for queries in the cosine similarity sense.
Retrieved documents for query Q1 in descending order, due to their score for
the term-matching method, are: D15, D12, D14, D9, D11 and D1. Other documents are
not retrieved at all, since their score is 0. So, the term-matching method has retrieved
6 out of 10 relevant documents. The retrieved documents for Q 1 applying LSI are: D1,
D11, D12, D9, D15, D2, D14, D13, D5 and D6. The score for other documents is much
lower and we can state that other documents are not retrieved at all. The retrieved
documents are exactly all the relevant documents. The retrieved documents for Q 1
applying CI are: D1, D14, D12, D11, D15, D5, D13, D2 and D9. These are all the relevant
documents except D6 document. For query Q2, only D6 document is relevant. The
term-matching method does not retrieve it at all, the LSI method recognizes D 6 as the
most relevant document (although it does not contain any term from the query) and
the CI method retrieves D6 as the sixth most relevant document.
As a conclusion of this academic example we can state that the LSI and CI
methods have a similar effect; they cluster documents on the similar topic even if
their term profile is different. It seems that on this example, LSI is working better. In
the EXPERIMENTS AND RESULTS section, we compare these two techniques on
much larger document collections to achieve statistically significant comparisons.
40
0,3
0,2
0,1
0,0
-0,1 Q2
y
-0,2
DM documents
-0,3
LA documents
Document D6
-0,4
Q1 Queries
-0,5
0 0,1 0,2 0,3 0,4 0,5 0,6
x
1,2
1,0
Q1
0,8 Q2
0,6
y
0,4
0,2
DM documents
LA documents
0,0 Document D6
Queries
-0,2
-0,1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8
x
41
4.4. LSI FOLDING-IN ALGORITHM
A A1 A2 (4.23)
With m we denote initial terms count, n1 we denote initial documents count and
n2 we denote new documents count. Then A1 matrix is the size of m x n1, A2 matrix is
the size of m x n2. Furthermore, txn(A1) A1 and txn(A2) A2 are performed, in the
other words A1’s and A2’s column vectors are normalized.
42
Truncated SVD is represented as A1k = Uk ∑k VkT. Query vectors matrix Q is a
size of q x m (queries are containing only initial terms).
This algorithm represents documents and queries in the LSI space without
implementation of new index terms coordinates and without correction of the singular
vectors. Algorithm is performed as described bellow.
Algorithm:
Vk
space DALL
DNEW
43
5. SYSTEM IMPLEMENTATION
Our system implements two dimensionality reduction methods. Concept Indexing (CI)
and Latent Semantic Indexing (LSI), whereas with more emphasis on CI method.
Because of lack of time, LSI method we have developed in MATLAB programming
language, and CI method we have developed within three main components, and is
more detailed and elaborated.
The first component of CI method system is Intel’s Math Kernel Library (BLAS and
LAPACK Libraries [11]). Because in this thesis IR is based on vector space model,
we found Intel’s BLAS and LAPACK libraries suitable for using for vector space
model implementation. The second component is the core implementation (back-
end), which is written in Microsoft Visual Studio .NET C++ programming language,
and finally the third component is a graphical user interface – GUI (front-end), written
in Microsoft Visual Studio .NET C# language. Each system’s component is described
in the following paragraphs.
1) The BLAS and LAPACK libraries are time-honored standards for solving a
large variety of linear algebra problems. The Intel Math Kernel Library (Intel
MKL) contains an implementation of BLAS and LAPACK that is highly
optimized for Intel processors. Intel MKL can enable you to achieve significant
performance improvements over alternative implementations of BLAS and
LAPACK.
BLAS
Basic Linear Algebra Subroutines (BLAS) provide the basic vector and matrix
operations underlying many linear algebra problems. Intel MKL BLAS support
includes:
44
For BLAS Levels 2 and 3, multiple matrix storage schemes are provided. All
BLAS functions within Intel MKL are thread-safe. Gain the performance
enhancements of multiprocessing without altering your application using
parallelized (threaded) BLAS Level 3 routines.
Sparse BLAS
Sparse BLAS is a set of functions that perform a number of common vector
operations on sparse vectors stored in compressed form. Sparse vectors are
those in which the majority of elements are zeros. Achieve large savings in
computer time and memory with sparse BLAS routines and functions that have
been specially implemented to take advantage of vector scarcity.
LAPACK
Intel MKL includes Linear Algebra Package (LAPACK) routines that are used
for solving:
Linear equations
Eigenvalue problems
Least-squares problems
Singular value problems
LAPACK routines support both real and complex data. Routines are supported
for systems of equations with the following types of matrices: general, banded,
symmetric or Hermitian, triangular, and tridiagonal. The LAPACK routines
within Intel MKL provide multiple matrix storage schemes. LAPACK routines
are available with a FORTRAN interface.
2) Back-end is composed of two parts. The first part is the core implementation of
concept indexing functionalities using Intel’s MKL library (written in
unmanaged c++). Because we decided to use C# for GUI, it was necessary to
transform basic core implementation from unmanaged to managed C++. So
the second part of back-end is wrapper-class which is used as mediator
between core back-end and GUI and is written in manage C++.
45
3) Front-end is written in C# programming language, because we found C# very
suitable language for visualization. GUI contains three different tab-windows.
Single test, Batch testing and Query testing windows.
In Single test window (see Figure 5.1) we can perform CI algorithm step by
step or to end with options of adjusting many algorithm parameters such as clustering
algorithm type (Spherical or Fuzzy k-means), number of concept vectors, type of
concept vectors initialization (Random values, Random documents or Perturbed
centroids) and initial percentage of documents from document collection to select.
Also we have implemented 2D example mode in which we are able to perform CI on
two types of examples.
46
In the first example (example type-1), on the graph of the right side of this
window, we are able to define documents in the positive quadrant by clicking in
certain point of the graph (Figure 5.2), and after that clustering those documents by
performing some of the clustering algorithms (Figure 5.3). Blue crosses represent
documents and red crosses represent concept vectors.
47
In the second example (example type-2), on the graph of the right side of this
window, we are able to perform clustering (CD and SVD) on document collection
defined in Example 2.1 in section 2.2. In the other words Example 4.1 from section
4.2 can be re-performed. In Figure 5.4 are shown document representations of
documents and queries in CI space (different example of clustering presented in
Figure 4.5).
48
Figure 5.5 depicts representations of terms in CI space (different example of
clustering presented in Figure 4.3).
49
Figure 5.6 depicts representations of documents and queries in LSI space (different
example of clustering presented in Figure 4.4).
Figure 5.6 Document and query representations in LSI space (example type-2)
50
Figure 5.7 depicts representations of terms in LSI space (different example of
clustering presented in Figure 4.2).
There are also algorithm informations such as documents and terms count,
memory consumptions and algorithm outputs in Single test tab-window.
51
In Options tab-window we are able to adjust many algorithm batch performing
parameters. Parameters like number of repeats for certain test, concept vectors
number range (of fixed number), clustering algorithm (Spherical, Fuzzy k-means or
both), type of concept vectors initialization (Random, Random documents or
Perturbed centroids), minimum and maximum term frequencies, initial documents
percentage range (or fixed percentage) and type of query matching condition. After
we adjust parameters for batch testing, then in the Batch menu we select Start Batch.
We are able to pause or stop batch testing any time, because we have implemented
it as a separate thread.
Batch results are presented in the Test list and results tab-window. Each test
is represented as one record (row) of the table. Columns of the table represent test
attributes such as number of repeats, clustering algorithm (Spherical or Fuzzy k-
means), type of concept vector’s initialization, number of concept vectors, number of
iterations for CI, average concept vectors dot product, decomposition error, mean
average precision (MAP) and F1 measure, and algorithm performing time.
52
Figure 5.9 Batch test tab-window (Test list and results) screenshot
In the Graph tab-window we are able to define parameters for batch testing
visualization. We can define coordinates of x and y axes. In the x axis we are able to
observe number of concept vectors, initial document percentage and fuzzy exponent
b. We can define two y axes and observe the following measures: average concept
vectors dot product, decomposition error, number of iterations, test duration, memory
consumption, recall measure (min, max and avg), precision (min, max and avg), MAP
(mean average precision) and F1 measure (min, max and avg). We are able to define
four types of graphs. Those are: points for each repeat in the test; test repeats
average values; test repeats median values and test’s mean and variations values.
Also we can define colors and labels for each graph.
53
Figure 5.10 Batch test tab-window (Graph) screenshot
In the Query testing tab-window (Figure 5.11) query testing and analyzing is
enabled. Query analyzing is enabled through four tables and graph visualization. In
the first table each row represents one query (query id, F 1, MAP, Recall, Precision,
number of relevant documents, number of retrieved documents maximum query-
document similarity. The content of the three other tables dynamically changes,
depending which row (query) is selected in the first table. The second table contains
terms for certain query. The third table contains relevant documents ids. And the last
table contains retrieved document ids and similarities. There are three types of query
matching. Similarity limit (retrieves all documents having similarity greater or equal
then that similarity limit), Fixed number n of retrieved documents (retrieves the best n
documents by similarity) and finally proportion limit (retrieved documents number is
equal to proportion limit times number of relevant documents). We can draw recall-
precision graphs for each query separately or an average of all queries. Also we are
able to draw queries relevant documents distributions.
54
Figure 5.11 Query testing tab-window screenshot
55
6. EXPERIMENTS AND RESULTS
In this section different experiment results shall be presented. Experiments are based
on comparison of two dimensionality reduction methods, latent semantic indexing
and concept indexing (for spherical and fuzzy k-means clustering). These two
methods are compared in two parameters.
Also experiments are performed for the defined folding-in method for LSI (only for
MEDLINE document collection).
And memory consumption and performance time experiments are performed and
results presented.
56
6.1. DECOMPOSITION ERROR
Figure 6.1 Relation between the number of concept/singular vectors and the
decomposition error for MEDLINE
57
Figure 6.2 depicts the relation between the decomposition error and the
number of concept/singular vectors for CRANFIELD document collection. This graph
confirms that SVD is better at decomposition than CD even for CRANFIELD
document collection, and also confirms that CD using fuzzy k-means clustering is
slightly better than CD using spherical k-means clustering for CRANFIELD collection.
Figure 6.2 Relation between the number of concept/singular vectors and the
decomposition error for CRANFIELD
58
6.2. ORTHONORMALITY OF CONCEPT VECTORS
Within this experiment is shown that concept vectors tend towards orthonormality.
Figure 6.3 depicts the average concept vectors dot product regarding the number of
concept vectors for MEDLINE document collection. With the increased number of
concept vectors, their average dot product tends towards 0, thereby they tend
towards orthonormality. CD using fuzzy k-means clustering acquires more tendency
towards orthonormality than CD using spherical k-means clustering on MEDLINE
collection.
Figure 6.3 Relation between the number of concept vectors and the average concept
vectors dot product for MEDLINE
59
Unlike CD on MEDLINE, Figure 6.3 depicts that there are no greater
differences in concept vector’s tendency towards orthonormality between CD using
fuzzy and spherical k-means clustering, on CRANFIELD document collection. But yet
it is still shown that concept vectors tend towards orthonormality even though with
less orthonormality tendency than for MEDLINE collection. This is probably because
MEDLINE has less documents than CRANFIELD (1033 comparing to 1398
respectively), but yet almost double index terms (7014 comparing to 3763
respectively). With a greater number of index terms it is easier to acquire
orthonormality because document vectors are very sparse.
Figure 6.4 Relation between the number of concept vectors and the average concept
vectors dot product for CRANFIELD
60
6.3. CI PERFORMANCES
Figure 6.5 shows CI using fuzzy clustering generally needs more iterations than
CI using spherical clustering, for MEDLINE document collection.
Figure 6.5 Relation between the number of concept vectors and the number of
iterations for MEDLINE
61
Figure 6.6 corroborates that CI using fuzzy clustering generally needs more iterations
than CI using spherical clustering even for CRANFIELD set.
Figure 6.6 Relation between the number of concept vectors and the number of
iterations for CRANFIELD
62
Tests performed with CI using fuzzy k-means clustering last more than tests
with CI using spherical clustering. This is because computation of fuzzy weights.
Figure 6.7 shows test duration for MEDLINE and Figure 6.8 shows test duration for
CRANFIELD.
Figure 6.7 Relation between the number of concept vectors and the test duration (in
seconds) for MEDLINE
63
Figure 6.8 Relation between the number of concept vectors and the test duration (in
seconds) for CRANFIELD
64
In Figure 6.9 we see that only for less than 25 concept vectors it has acquired
memory saving (for MEDLINE). We are also able to see that memory consumption
for CI linearly progresses and that generally CI using fuzzy clustering needs more
memory than CI using spherical clustering. Fuzzy weight matrix (number of
documents times number of concept vectors) is additional in CI using fuzzy clustering
and that’s why fuzzy clustering uses more memory.
Figure 6.9 Relation between number of concept vectors and memory consumption
for MEDLINE
65
Figure 6.10 shows CI’s memory consumption for CRANFIELD.
Figure 6.10 Relation between number of concept vectors and memory consumption
for CRANFIELD
We can conclude that CI using fuzzy clustering needs more iteration, time and
more memory than CI using spherical, while performing.
66
6.4. INFORMATION RETRIEVAL EVALUATION
67
6.4.1. WITHOUT FOLDING-IN DOCUMENTS
Figure 6.11 shows the average recall for all queries regarding number of
concept/singular vectors for MEDLINE document collection. Recall for LSI is less
than recall for CI. Furthermore, regardless which clustering algorithm is used in CI,
pretty similar recall is retrieved.
Figure 6.11 Relation between the number of concept/singular vectors and the
average recall of all queries for MEDLINE
68
Figure 6.12 depicts the average recall for all queries regarding number of
concept/singular vectors for CRANFIELD document collection. Unlike for MEDLINE,
for CRANFIELD is inversely, thus recall is greater for LSI than CI. And similar to
MEDLINE collection, spherical and fuzzy k-means clustering algorithms give similar
recall.
Figure 6.12 Relation between the number of concept/singular vectors and the
average recall of all queries for CRANFIELD
69
Moreover, relation between precision and number of concept/singular vectors
for MEDLINE is shown in Figure 6.13 and for CRANFIELD in Figure 6.14.
In Figure 6.13 we see that precision for LSI is greater than for CI. So far LSI
has greater precision than CI, but less recall for MEDLINE. Spherical k-means gives
slightly better results than fuzzy k-means clustering CI, regarding precision.
Figure 6.13 Relation between the number of concept/singular vectors and the
average precision of all queries for MEDLINE
70
On CRANFIELD collections we see that both, precision and recall are greater
for LSI than CI. There is almost no difference between spherical and fuzzy k-means
clustering CI.
Figure 6.14 Relation between the number of concept/singular vectors and the
average precision of all queries for CRANFIELD
71
Figure 6.15 shows query-document matching in the original space and in the
reduced dimensionality space (LSI and CI). We are able to see that generally LSI
gives greater MAP than CI, but we are also able to see that LSI and CI (spherical and
fuzzy k-means clustering) reach maximum MAP for 50 concept vectors. We can
assume that MEDLINE document collection can be naturally divided into 50 clusters.
Furthermore, CI using fuzzy k-means clustering gives greater MAP than using
spherical clustering, for the natural number of clusters (50), but for more clusters it is
vice versa. This graph also depicts that better results are acquired in reduced then in
the original space, although information reduction is performed. This is because LSI
and CI are resolving the problems of synonymy and polysemy, redundancy and
noise.
Figure 6.15 Relation between the number of concept/singular vectors and the
average MAP of all queries for MEDLINE
72
Unlike MEDLINE, results for experiments performed on CRANFIELD (see
Figure 6.16) are different. Better results regarding MAP are acquired in the original
than in the reduced dimensionality space. Also LSI MAP graph grows monotonously;
in the other words does not reaches maximum. CI MAP graphs start reaching
saturation at 100 concept vector. The MAP graph for spherical CI almost completely
matches the MAP graph for fuzzy CI.
Figure 6.16 Relation between the number of concept/singular vectors and the
average MAP of all queries for CRANFIELD
73
Figure 6.17 shows much better results for LSI than for CI, regarding F 1
measure. Also we can see that LSI gives better results than query-document
matching in the original space. Furthermore, CI using spherical k-means gives
slightly better F1 measure than CI using fuzzy clustering.
Figure 6.17 Relation between the number of concept/singular vectors and the
average F1 of all queries for MEDLINE
74
CI method on CRANFIELD also gives less F1 measure values then LSI (see
Figure 6.18). F1 measure value for query-document matching in the original space is
reached by LSI method for 100 singular vectors.
Figure 6.18 Relation between the number of concept/singular vectors and the
average F1 of all queries for CRANFIELD
75
In Figure 6.19 we see the recall-precision graph for CI (fuzzy clustering) with
50 concept vectors performed on MEDLINE document collection.
76
Figure 6.20 Recall-precision graph for CRANFIELD
77
6.4.2. WITH FOLDING-IN DOCUMENTS
In this section folding-in methods for LSI is tested only for MEDLINE document
collection. Tests are performed starting with 10% and finishing with 100% of the initial
documents, with steps of 10%.
Figure 6.21 shows that recall grows by greater initial documents percentage.
We can also see that for initial documents of 80% percent of all documents from the
document collection is reached maximum recall. This is probably because between
80% and 100% initial documents, doesn’t occur (or very small amount) new terms.
Figure 6.21 Relation between the initial document percentage and recall
78
Precision graphs (Figure 6.22) have the same characteristics as recall graphs.
Figure 6.22 Relation between the initial document percentage and precision
79
Figure 6.23 shows MAP measure.
Figure 6.23 Relation between the initial document percentage and MAP
80
And finally Figure 6.24 depicts F1 measure.
81
7. RELATED WORKS
Dhillon and Modha in their work [7] compare concept decomposition (CD), which they
have developed, to singular value decomposition (SVD), in terms of similarity of the
original document matrix and the matrix approximation. But in this work they don’t
research the convenience of document representation using concept decomposition
for text mining tasks like information retrieval or text classification.
82
Kogan and associate workers in [14] perform optimization of k-means by
combining batch and incremental k-means clustering algorithm. They use a distance-
like function which combines Euclidian distance and relative entropy.
83
8. CONCLUSIONS
In this thesis the LSI and CI reduction dimensionality methods are experimentally
compared.
We have shown with our experiments that concept vector tend towards
orthonormality. Concept vectors are local and have well defined semantic, while
singular vectors are global and cannot be interpreted. Also concept vectors are very
sparse (often more than 85%). Regarding approximation error, LSI gives the best
approximation of the document matrix in the reduced dimensionality space, although
CI’s approximation error is comparable to LSI’s.
We have also compared clustering algorithms (fuzzy and spherical) for CI. CI
using fuzzy k-means clustering gives slightly better results in IR evaluation then CI
using spherical clustering. Also we have shown that spherical clustering algorithm is
faster and needs less memory than fuzzy clustering algorithm. Better IR evaluation
results are acquired for MEDLINE than for CRANFIELD, using both reduction
dimensionality methods (CI and LSI).
From the results of experiments we have performed can be concluded that LSI and
CI can be compared, regarding IR.
84
9. REFERENCES
[4] I.S. Dhillon, J. Fan, Y. Guan, Efficient Clustering of Very Large Document
Collections, Data Mining for Scientific and Engineering Applications, Kluwer
Academic Publishers, pp. 357–381, 2001.
[5] I.S. Dhillon, Y. Guan and J. Kogan, Refining Clusters in High Dimensional Text
Data, 2nd SIAM International Conference on Data Mining (Workshop on
Clustering High-Dimensional Data and its Applications), April 2002.
[7] I.S. Dhillon, D. S. Modha, Concept decomposition for large sparse text data
using clustering, Machine Learning, Vol. 42, no. 1, pp. 143-175., 2001.
85
[9] J. Dobša, B. Dalbelo Bašić, Concept decomposition by fuzzy k-means
algorithm, Proceedings of IEEE / WIC International Conference on Web
Intelligence, pp. 684-688, Halifax, Canada, 2003.
[12] A.K. Jain, M.N. Murty, P.J. Flynn, Data Clustering: A Review, ACM Computing
Surveys, Vol 31, No. 3, pp. 264-323, 1999.
[15] T.A. Letsche, M.W. Berry, Large-scale Information Retrieval with Latent
Semantic Indexing, Information Sciences - Applications, 1997.
86
[20] Latent Semantic Indexing Web Site, http://www.cs.utk.edu/~lsi/, [19.6.2006].
[25] T.G. Kolda, D.P. O’Leary, A semi-discrete matrix decomposition for latent
semantic indexing in information retrieval, ACM Transcations on Information
Systems, Vol. 16, pp. 322–346, 1998.
[26] J.C. Bezdek, A convergence theorem for the fuzzy ISODATA clustering
algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 2, No. 1, pp. 1–8., 1990.
[27] J.C. Bezdek, R.J. Hathway, Convergence theory for fuzzy c-means:
Counterexamples and repairs, IEEE Trans. Systems, Man, Cybernetics, Vol.
17, No. 5, pp. 873–877., 1987.
87
[31] M.E. Maron, J.L. Kuhns, On relevance, probabilistic indexing and information
retrieval, Association for Computing Machinery 7 (1960)3, pp. 216-244.
[33] L.D. Baker, A.K. McCallum, Distributional clustering of words for text
categorisation, Proceedings of SIGIR-98, 21st ACM International Conference
on Research and Developement in Information Retrieval, pp. 96-103,
Melbourne, Australia, 1998.
[34] J. Yen, R. Langari, Fuzzy Logic: Intelligence, Control and Information, Prantice
Hall, New Jersey, 1999.
[36] H. Park, M. Jeon, J. Ben Rosen, Lower dimensional representation of text data
based on centroids and least squares, BIT, 43 (2003) 3, pp. 1-22.
[37] C. Park, H. Park, Nonlinear feature extraction based on centroids and kernel
functions, Pattern Recognition 37 (2004)4, pp. 801-810.
88
TABLES INDEX
89
FIGURES INDEX
Figure 6.1 Relation between the number of concept/singular vectors and the decomposition
error for MEDLINE ................................................................................................................. 57
Figure 6.2 Relation between the number of concept/singular vectors and the decomposition
error for CRANFIELD ............................................................................................................. 58
Figure 6.3 Relation between the number of concept vectors and the average concept vectors
dot product for MEDLINE ....................................................................................................... 59
Figure 6.4 Relation between the number of concept vectors and the average concept vectors
dot product for CRANFIELD ................................................................................................... 60
Figure 6.5 Relation between the number of concept vectors and the number of iterations for
MEDLINE................................................................................................................................. 61
Figure 6.6 Relation between the number of concept vectors and the number of iterations for
CRANFIELD ............................................................................................................................ 62
Figure 6.7 Relation between the number of concept vectors and the test duration (in seconds)
for MEDLINE ........................................................................................................................... 63
Figure 6.8 Relation between the number of concept vectors and the test duration (in seconds)
for CRANFIELD ....................................................................................................................... 64
Figure 6.9 Relation between number of concept vectors and memory consumption for
MEDLINE................................................................................................................................. 65
Figure 6.10 Relation between number of concept vectors and memory consumption for
CRANFIELD ............................................................................................................................ 66
90
Figure 6.11 Relation between the number of concept/singular vectors and the average recall
of all queries for MEDLINE ..................................................................................................... 68
Figure 6.12 Relation between the number of concept/singular vectors and the average recall
of all queries for CRANFIELD................................................................................................. 69
Figure 6.13 Relation between the number of concept/singular vectors and the average
precision of all queries for MEDLINE ..................................................................................... 70
Figure 6.14 Relation between the number of concept/singular vectors and the average
precision of all queries for CRANFIELD ................................................................................. 71
Figure 6.15 Relation between the number of concept/singular vectors and the average MAP
of all queries for MEDLINE ..................................................................................................... 72
Figure 6.16 Relation between the number of concept/singular vectors and the average MAP
of all queries for CRANFIELD................................................................................................. 73
Figure 6.17 Relation between the number of concept/singular vectors and the average F1 of
all queries for MEDLINE ......................................................................................................... 74
Figure 6.18 Relation between the number of concept/singular vectors and the average F1 of
all queries for CRANFIELD ..................................................................................................... 75
Figure 6.19 Recall-precision graph for MEDLINE................................................................. 76
Figure 6.20 Recall-precision graph for MEDLINE................................................................. 77
Figure 6.21 Relation between the initial document percentage and recall ............................. 78
Figure 6.22 Relation between the initial document percentage and precision ....................... 79
Figure 6.23 Relation between the initial document percentage and MAP .............................. 80
Figure 6.24 Relation between the initial document percentage and F1 .................................. 81
91