Beruflich Dokumente
Kultur Dokumente
Document Classification
planning
Test language
Data: proof
intelligence
great 2
love 2
recommend 1
laugh 1
happy 1
Introduction to Information Retrieval Sec.13.6
Evaluating Categorization
Evaluation must be done on test data that are
independent of the training data
Sometimes use cross-validation (averaging results
over multiple training and test splits of the overall
data)
Easy to get good performance on a test set
that was available to the learner during
training (e.g., just memorize the test set)
Introduction to Information Retrieval Sec.13.6
Evaluating Categorization
Measures: precision, recall, F1, classification
accuracy
Classification accuracy: r/n where n is the
total number of test docs and r is the number
of test docs correctly classified
Introduction to Information Retrieval
Nave Bayes Classifier
The Bayesian Classification represents a supervised learning method as well as a
statistical method for classification. Assumes an underlying probabilistic model and
it allows us to capture uncertainty about the model in a principled way by
determining probabilities of the outcomes. It can solve diagnostic and predictive
problems.
This Classification is named after Thomas Bayes, who proposed the Bayes
Theorem.
Bayesian classification provides practical learning algorithms and prior knowledge
and observed data can be combined
lets say we have data on 1000 pieces of fruit. The fruit being a Banana, Orange or
some Other fruit and imagine we know 3 features of each fruit, whether its long or
not, sweet or not and yellow or not, as displayed in the table below
An Example of Text
Classification with Nave Bayes
Vector Space Classification
Support vector machines and Machine
learning on documents.
FAST and Hierarchical Clustering
Document clustering
Motivations
Document representations
Success criteria
Clustering algorithms
Partitional
Hierarchical
Ch. 16
What is clustering?
Clustering: the process of grouping a set of
objects into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Ch. 16
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Sec. 16.1
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective user recall will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Sec. 16.1
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of
gravity or mean) of points in a cluster, c:
1
(c)
| c | xc
x
K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Sec. 16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Sec. 16.4
Time Complexity
Computing distance between two docs is
O(M) where M is the dimensionality of the
vectors.
Reassigning clusters: O(KN) distance
computations, or O(KNM).
Computing centroids: Each doc gets added
once to some centroid: O(NM).
Assume these two steps are each done once
for I iterations: O(IKNM).
Sec. 16.4
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
59
Sec. 17.1
Note: the resulting clusters are still hard and induce a partition
Sec. 17.2
Complete Link
Use minimum similarity of pairs:
sim (ci ,c j ) min sim ( x, y)
xci , yc j
Makes tighter, spherical clusters that are typically
preferable.
After merging ci and cj, the similarity of the resulting cluster
to another cluster, ck, is:
Computational Complexity
In the first iteration, all HAC methods need to compute
similarity of all pairs of N initial instances, which is
O(N2).
In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Sec. 16.3
Purity example
Same class in
ground truth 20 24
Different
classes in 20 72
ground truth
Sec. 16.3
A D
RI
A B C D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-
measure, which is probably a better measure.
Introduction to Information Retrieval Sec. 18.1
Matrix-vector multiplication
Matrix-vector multiplication
Thus a matrix-vector multiplication such as Sx
(S, x as in the previous slide) can be rewritten
in terms of the eigenvalues/vectors:
Matrix-vector multiplication
Suggestion: the effect of small eigenvalues is
small.
If we ignored the smallest eigenvalue (1), then
instead of
we would get
Example
Let Real, symmetric.
Then
Eigen/diagonal Decomposition
Let be a square matrix with m
linearly independent eigenvectors (a non-
defective matrix) diagonal
Unique
for
Theorem: Exists an eigen decomposition distinct
eigen-
(cf. matrix diagonalization theorem) values
And S=UU1.
Introduction to Information Retrieval Sec. 18.1
Recall
Then, S=UU1 =
Introduction to Information Retrieval Sec. 18.1
Example continued
Lets divide U (and multiply U1) by
Then, S=
Q (Q-1= QT )
Exercise
Examine the symmetric eigen decomposition,
if any, for each of the following matrices:
Introduction to Information Retrieval
Similarity Clustering
We can compute the similarity between two
document vector representations xi and xj by xixjT
Let X = [x1 xN]
Then XXT is a matrix of similarities
Xij is symmetric
So XXT = QQT
So we can decompose this similarity space into a
set of orthonormal basis vectors (given in Q)
scaled by the eigenvalues in
This leads to PCA (Principal Components Analysis)
17
Introduction to Information Retrieval Sec. 18.2
MM MN V is NN
MM MN V is NN
AAT = QQT
AAT = (UVT)(UVT)T = (UVT)(VUT) = U2UT
The columns of U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues 1 r of AAT are the eigenvalues of ATA.
Singular values
Introduction to Information Retrieval Sec. 18.2
SVD example
Let
Reduced SVD
What it is
From term-doc matrix A, we compute the
approximation Ak.
There is a row for each term and a column
for each doc in Ak
Thus docs live in a space of k<<r
dimensions
These dimensions are not the original axes
But why?
Introduction to Information Retrieval
Goals of LSI
LSI takes documents that are semantically similar
(= talk about the same topics), but are not similar
in the vector space (because they use different
words) and re-represents them in a reduced
vector space in which they have higher similarity.
LSA Example
A simple example term-document matrix
(binary)
37
Introduction to Information Retrieval
LSA Example
Example of C = UVT: The matrix U
38
Introduction to Information Retrieval
LSA Example
Example of C = UVT: The matrix
39
Introduction to Information Retrieval
LSA Example
Example of C = UVT: The matrix VT
40
Introduction to Information Retrieval
41
Introduction to Information Retrieval
42
Introduction to Information Retrieval
43
Introduction to Information Retrieval
Simplistic picture
Topic 1
Topic 2
Topic 3
Data Fusion
Outline
What is data fusion?
Why use data fusion?
Previous work
Components of data fusion
System selection
Bias concept
Data fusion methods
Experiments
Conclusion
109
Data Fusion
Merging the retrieval results of multiple
systems.
A data fusion algorithm accepts two or more
ranked lists and merges these lists into a single
ranked list with the aim of providing better
effectiveness than all systems used for data
fusion.
110
Why use data fusion?
Combining evidence from different
systems leads to performance
improvement
Use data fusion to achieve better
performance than the individual
systems involved in the process.
Example metasearch systems
www.dogpile.com
www.copernic.com
111
Why use data fusion?
Same idea is also used for different query
representations
Fuse the results of different query
representations for the same request and
obtain better results
Measuring relative performance of IR systems
such as web search engines is essential
Use data fusion for finding pseudo relevant
documents and use these for automatic
ranking of retrieval systems
112
Components of data fusion
1. DB/search engine selector
Select systems to fuse
2. Query dispatcher
Submit queries to selected search engines
3. Document selector
Select documents to fuse
4. Result merger
Merge selected document results
113
Ranking retrieval systems
114
System selection methods
1. Best: certain percentage of top performing
systems used
2. Normal: all systems to be ranked are used
3. Bias: certain percentage of systems that
behave differently from the norm (majority
of all systems) are used
115
Calculating bias of a system
Similarity value
s(v, w)
v w i i
v: vector of norm
Bias of a system
B(v, w) 1 s (v, w)
116
Example of calculating bias
2 systems: A and B
7 documents: a, b, c, d, e, f, g
ith row is the result for ith query
XA=(3, 3, 3, 2, 1, 0, 0) XB=(0, 2, 3, 0, 2, 3, 2)
s(XA,X)=49/[32][96]1/2 = 0.8841
Bias(A)=1-0.8841=0.1159
s(XB,X)=47/[30][96]1/2 = 0.8758
Bias(B)=1-0.8758=0.1242
117
Bias calculation with order
Order is important because users usually just look
at the documents of higher rank.
2 systems: A and B
7 documents: a, b, c, d, e, f, g
ith row is the result for ith query
m=4
XA=(10, 8, 4, 2, 1, 0, 0); XB=(0, 8, 22/3, 0, 2, 8/3, 7/3)
Bias(A)=0.0087; Bias(B)=0.1226
118
Data fusion methods
1. Similarity value models
CombMIN, CombMAX, CombMED,
CombSUM, CombANZ, CombMNZ
119
Similarity value methods
120
Rank position method
Merge documents using only rank positions
Rank score of document i (j: system index)
1
r (d i )
j 1 pos(dij )
If a system j has not ranked document i at all,
skip it.
121
Rank position example
4 systems: A, B, C, D
documents: a, b, c, d, e, f, g
Query results:
A={a,b,c,d}, B={a,d,b,e},
C={c,a,f,e}, D={b,g,e,f}
r(a)=1/(1+1+1/2)=0.4
r(b)=1/(1/2+1/3+1)=0.52
Final ranking of documents:
(most relev) a > b > c > d > e > f > g (least relev)
122
Borda Count method
Based on democratic election strategies.
The highest ranked document in a system gets
n Borda points and each subsequent gets one
point less where n is the number of total
retrieved documents by all systems.
123
Borda Count example
3 systems: A, B, C
Query results:
A={a,c,b,d}, B={b,c,a,e}, C={c,a,b,e}
5 distinct docs retrieved: a, b, c, d, e. So, n=5.
BC(a)=BCA(a)+BCB(a)+BCC(a)=5+3+4=12
BC(b)=BCA(b)+BCB(b)+BCC(b)=3+5+3=11
Final ranking of documents:
(most relevant) c > a > b > e > d (least relevant)
124
Condorcet method
Also, based on democratic election strategies.
Majoritarian method
The winner is the document which beats each of
the other documents in a pair wise comparison.
125
Condorcet example
3 candidate documents: a, b, c
5 systems: A, B, C, D, E
A: a>b>c - B:a>c>b - C:a>b=c - D:b>a - E:c>a
127
Experiments
First Approach
Mean average precision values of merged system
is significantly greater than al the individual
systems
Second Approach
Find the data fusion method that gives the highest
mean average precision value
128
Experiments
Third Approach
Find the best stemming method in terms of mean
average precision values
Fourth Approach
See the effect of system selection methods
129
Conclusion
Data Fusion is an active research area
We will use several data fusion techniques on
the now famous Milliyet database and
compare their relative merits
We will also use TREC data for testing if
possible
We will hopefully find some novel approaches
in addition to existing methods
130