Sie sind auf Seite 1von 144

Carnegie Mellon

Powerful Tools for Data Mining


Fractals, power laws, SVD
C. Faloutsos
Carnegie Mellon University
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 2
Part I:
Singular Value Decomposition (SVD)
= Eigenvalue decomposition

www.cs.cmu.edu/~christos/
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 3
SVD - outline
Motivation
Definition - properties
Interpretation; more properties
Complexity
Applications - success stories
Conclusions
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 4
SVD - Motivation
problem #1: text - LSI: find concepts
problem #2: compression / dim. reduction
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 5
SVD - Motivation
problem #1: text - LSI: find concepts
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 6
SVD - Motivation
problem #2: compress / reduce
dimensionality
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 7
Problem - specs
~10**6 rows; ~10**3 columns; no updates;
Compress / extract features
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 8
SVD - Motivation
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 9
SVD - Motivation
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 10
SVD - outline
Motivation
Definition - properties
Interpretation; more properties
Complexity
Applications - success stories
Conclusions
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 11
SVD - Definition
(reminder: matrix multiplication

1 2
3 4
5 6
x
1
-1
3 x 2 2 x 1
=
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 12
SVD - Definition
(reminder: matrix multiplication

1 2
3 4
5 6
x
1
-1
3 x 2 2 x 1
=
3 x 1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 13
SVD - Definition
(reminder: matrix multiplication

1 2
3 4
5 6
x
1
-1
3 x 2 2 x 1
=
-1
3 x 1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 14
SVD - Definition
(reminder: matrix multiplication

1 2
3 4
5 6
x
1
-1
3 x 2 2 x 1
=
-1
-1
3 x 1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 15
SVD - Definition
(reminder: matrix multiplication

1 2
3 4
5 6
x
1
-1
=
-1
-1
-1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 16
SVD - notation
Conventions:
bold capitals -> matrix (eg. A, U, L, V)
bold lower-case -> column vector (eg., x,
v
1
, u
3
)
regular lower-case -> scalars (eg., l
1
, l
r
)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 17
SVD - Definition
A
[n x m]
= U
[n x r]
L
[ r x r]
(V
[m x r]
)
T
A: n x m matrix (eg., n documents, m
terms)
U: n x r matrix (n documents, r concepts)
L: r x r diagonal matrix (strength of each
concept) (r : rank of the matrix)
V: m x r matrix (m terms, r concepts)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 18
SVD - Definition
A = U L V
T
- example:

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 19
SVD - Properties
THEOREM [Press+92]: always possible to
decompose matrix A into A = U L V
T
,
where
U, L, V: unique (*)
U, V: column orthonormal (ie., columns are
unit vectors, orthogonal to each other)
U
T
U = I; V
T
V = I (I: identity matrix)
L: eigenvalues are positive, and sorted in
decreasing order
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 20
SVD - Example
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 21
SVD - Example
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
CS-concept
MD-concept
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 22
SVD - Example
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
CS-concept
MD-concept
doc-to-concept
similarity matrix
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 23
SVD - Example
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
strength of CS-concept
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 24
SVD - Example
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
term-to-concept
similarity matrix
CS-concept
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 25
SVD - Example
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
term-to-concept
similarity matrix
CS-concept
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 26
SVD - outline
Motivation
Definition - properties
Interpretation; more properties
Complexity
Applications - success stories
Conclusions
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 27
SVD - Interpretation #1
documents, terms and concepts:
[Foltz+92]
U: document-to-concept similarity matrix
V: term-to-concept sim. matrix
L: its diagonal elements: strength of each
concept

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 28
SVD - Interpretation #1
documents, terms and concepts:
Q: if A is the document-to-term matrix, what
is A
T
A?

Q: A A
T
?
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 29
SVD - Interpretation #1
documents, terms and concepts:
Q: if A is the document-to-term matrix, what
is A
T
A?
A: term-to-term ([m x m]) similarity matrix
Q: A A
T
?
A: document-to-document ([n x n]) similarity
matrix
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 30
SVD - Interpretation #2
best axis to project on: (best = min sum of
squares of projection errors)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 31
SVD - Motivation
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 32
SVD - interpretation #2
minimum RMS error
SVD: gives
best axis to project
v1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 33
SVD - Interpretation #2
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 34
SVD - Interpretation #2
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
v1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 35
SVD - Interpretation #2
A = U L V
T
- example:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
variance (spread) on the v1 axis
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 36
SVD - Interpretation #2
A = U L V
T
- example:
U L gives the coordinates of the points in the
projection axis

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 37
SVD - Interpretation #2
More details
Q: how exactly is dim. reduction done?

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 38
SVD - Interpretation #2
More details
Q: how exactly is dim. reduction done?
A: set the smallest eigenvalues to zero:

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 39
SVD - Interpretation #2

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
~
9.64 0
0 0
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 40
SVD - Interpretation #2

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
~
9.64 0
0 0
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 41
SVD - Interpretation #2

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18
0.36
0.18
0.90
0
0
0
~
9.64
x
0.58 0.58 0.58 0 0
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 42
SVD - Interpretation #2

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
~
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 43
SVD - Interpretation #2
Exactly equivalent:
spectral decomposition of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 44
SVD - Interpretation #2
Exactly equivalent:
spectral decomposition of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
=
x x
u
1
u
2

l
1

l
2

v
1

v
2

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 45
SVD - Interpretation #2
Exactly equivalent:
spectral decomposition of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
= u
1
l
1
v
T
1
u
2
l
2
v
T
2
+
+...
n
m
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 46
SVD - Interpretation #2
Exactly equivalent:
spectral decomposition of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
= u
1
l
1
v
T
1
u
2
l
2
v
T
2
+
+...
n
m
n x 1
1 x m
r terms
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 47
SVD - Interpretation #2
approximation / dim. reduction:
by keeping the first few terms (Q: how many?)
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
= u
1
l
1
v
T
1
u
2
l
2
v
T
2
+
+...
n
m
assume: l
1
>= l
2
>= ...
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 48
SVD - Interpretation #2
A (heuristic - [Fukunaga]): keep 80-90% of
energy (= sum of squares of l
i
s)
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
= u
1
l
1
v
T
1
u
2
l
2
v
T
2
+
+...
n
m
assume: l
1
>= l
2
>= ...
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 49
SVD - outline
Motivation
Definition - properties
Interpretation
#1: documents/terms/concepts
#2: dim. reduction
#3: picking non-zero, rectangular blobs
#4: fixed points - graph interpretation
Complexity
...
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 50
SVD - Interpretation #3
finds non-zero blobs in a data matrix
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 51
SVD - Interpretation #3
finds non-zero blobs in a data matrix
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 52
SVD - Interpretation #3
finds non-zero blobs in a data matrix
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 53
SVD - Interpretation #3
finds non-zero blobs in a data matrix
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 54
SVD - Interpretation #4
A as vector transformation
2 1
1 3
A
1
0
x
2
1
x
=
x
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 55
SVD - Interpretation #4
if A is symmetric, then, by defn., its
eigenvectors remain parallel to themselves
(fixed points)
2 1
1 3
A
0.52
0.85
v
1
v
1

=
0.52
0.85
3.62 *
l
1

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 56
SVD - Interpretation #4
fixed point operation (A
T
A is symmetric)
A
T
A v
i
= l
i
2
v
i
(i=1,.., r)

[Property 1]
eg., A: transition matrix of a Markov Chain;
then v
1
is related to the steady state probability
distribution (= a particle, doing a random walk
on the graph)
convergence: for (almost) any v:
(A
T
A)
k
v

~ (constant) v
1
[Property 2]
where v
1
: strongest right eigenvector
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 57
SVD - Interpretation #4
convergence - illustration:
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 58
SVD - Interpretation #4
convergence - illustration:
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 59
SVD - Interpretation #4
v
1
... v
r
: right eigenvectors
symmetrically: u
1
... u
r
: left eigenvectors
for matrix A (= U L V
T
):
A v
i
= l
i
u
i

u
i

T
A = l
i
v
i

T
(right eigenvectors -> authority scores in
Kleinbergs algo;
left ones -> hub scores)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 60
SVD - outline
Motivation
Definition - properties
Interpretation
Complexity
Applications - success stories
Conclusions
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 61
SVD - Complexity
O( n * m * m) or O( n * n * m) (whichever
is less)
less work, if we just want eigenvalues
... or if we want first k eigenvectors
... or if the matrix is sparse [Berry]
Implemented: in any linear algebra package
(LINPACK, matlab, Splus, mathematica ...)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 62
SVD - conclusions so far
SVD: A= U L V
T
: unique (*)
U: document-to-concept similarities
V: term-to-concept similarities
L: strength of each concept
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 63
SVD - conclusions so far
dim. reduction: keep the first few strongest
eigenvalues (80-90% of energy)
SVD: picks up linear correlations
SVD: picks up non-zero blobs
v
1
: fixed point (-> steady-state prob.)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 64
SVD - outline
Motivation
Definition - properties
Interpretation
Complexity
Applications - success stories
Conclusions
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 65
SVD - Case studies
...
Applications - success stories
multi-lingual IR; LSI queries
compression
PCA - ratio rules
Karhunen-Lowe transform
Kleinberg/google algorithm
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 66
Case study - LSI
Q1: How to do queries with LSI?
Q2: multi-lingual IR (english query, on
spanish text?)


Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 67
Case study - LSI
Q1: How to do queries with LSI?
Problem: Eg., find documents with data

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 68
Case study - LSI
Q1: How to do queries with LSI?
A: map query vectors into concept space how?

1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.80
0 0.27
=
CS
MD
9.64 0
0 5.29
x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 69
Case study - LSI
Q1: How to do queries with LSI?
A: map query vectors into concept space how?

1 0 0 0 0

data
inf.
retrieval
brain
lung
q
T
=
term1
term2
v1
q
v2
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 70
Case study - LSI
Q1: How to do queries with LSI?
A: map query vectors into concept space how?

1 0 0 0 0

data
inf.
retrieval
brain
lung
term1
term2
v1
q
v2
A: inner product
(cosine similarity)
with each concept vector v
i
q
T
=
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 71
Case study - LSI
Q1: How to do queries with LSI?
A: map query vectors into concept space how?

1 0 0 0 0

data
inf.
retrieval
brain
lung
term1
term2
v1
q
v2
A: inner product
(cosine similarity)
with each concept vector v
i
q o v1
q
T
=
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 72
Case study - LSI
compactly, we have:
q
T
concept
= q
T
V
Eg:
1 0 0 0 0

data
inf.
retrieval
brain
lung
q
T
=
0.58 0
0.58 0
0.58 0
0 0.71
0 0.71

term-to-concept
similarities
=
0.58 0

CS-concept
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 73
Case study - LSI
Drill: how would the document (information,
retrieval) handled by LSI?
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 74
Case study - LSI
Drill: how would the document (information,
retrieval) handled by LSI? A: SAME:
d
T
concept
= d
T
V
Eg:
0 1 1 0 0

data
inf.
retrieval
brain
lung
d
T
=
0.58 0
0.58 0
0.58 0
0 0.71
0 0.71

term-to-concept
similarities
=
1.16 0

CS-concept
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 75
Case study - LSI
Observation: document (information, retrieval)
will be retrieved by query (data), although it
does not contain data!!
0 1 1 0 0

data
inf.
retrieval
brain
lung
d
T
=
1.16 0

CS-concept
1 0 0 0 0

0.58 0

q
T
=
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 76
Case study - LSI
Q1: How to do queries with LSI?
Q2: multi-lingual IR (english query, on
spanish text?)


Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 77
Case study - LSI
Problem:
given many documents, translated to both
languages (eg., English and Spanish)
answer queries across languages

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 78
Case study - LSI
Solution: ~ LSI
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
CS
MD
1 1 1 0 0
1 2 2 0 0
1 1 1 0 0
5 5 4 0 0
0 0 0 2 2
0 0 0 2 3
0 0 0 1 1
datos
informacion
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 79
Case study - LSI
Solution: ~ LSI
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 3
0 0 0 1 1
data
inf.
retrieval
brain
lung
CS
MD
1 1 1 0 0
1 2 2 0 0
1 1 1 0 0
5 5 4 0 0
0 0 0 2 2
0 0 0 2 3
0 0 0 1 1
datos
informacion
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 80
SVD - outline - Case studies
...
Applications - success stories
multi-lingual IR; LSI queries
compression
PCA - ratio rules
Karhunen-Lowe transform
Kleinberg/google algorithm
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 81
Case study: compression
[Korn+97]
Problem:
given a matrix
compress it, but maintain random access
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 82
Problem - specs
~10**6 rows; ~10**3 columns; no updates;
random access to any cell(s) ; small error: OK
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 83
Idea
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 84
SVD - reminder
space savings: 2:1
minimum RMS error


Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 85
Performance - scaleup
space
error
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 86
SVD - outline - Case studies
...
Applications - success stories
multi-lingual IR; LSI queries
compression
PCA - ratio rules
Karhunen-Lowe transform
Kleinberg/google algorithm
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 87
PCA - Ratio Rules
[Korn+00]
Typically: Association Rules (eg.,
{bread, milk} -> {butter}
But:
which set of rules is better?
how to reconstruct missing/corrupted values?
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 88
PCA - Ratio Rules
Idea: try to find concepts:
Eigenvectors dictate rules about ratios:
bread:milk:butter = 2:4:3
$ on bread
$ on milk
$ on butter
2
3
4
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 89
PCA - Ratio Rules
[Korn+00]
Typically: Association Rules (eg.,
{bread, milk} -> {butter}
But:
how to reconstruct missing/corrupted values?
which set of rules is better?
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 90
PCA - Ratio Rules
Moreover: visual data mining, for free:
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 91
PCA - Ratio Rules
no Gaussian clusters; Zipf-like distribution
phonecalls
stocks
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 92
PCA - Ratio Rules
NBA dataset
~500 players;
~30 attributes
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 93
PCA - Ratio Rules
PCA: get eigenvectors v1, v2, ...
ignore entries with small abs. value
try to interpret the rest
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 94
PCA - Ratio Rules
NBA dataset - V matrix (term to concept similarities)
v1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 95
Ratio Rules - example
RR1: minutes:points = 2:1
corresponding concept?
v1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 96
Ratio Rules - example
RR1: minutes:points = 2:1
corresponding concept?
A: goodness of player
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 97
Ratio Rules - example
RR2: points: rebounds negatively
correlated(!)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 98
Ratio Rules - example
RR2: points: rebounds negatively
correlated(!) - concept?
v2
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 99
Ratio Rules - example
RR2: points: rebounds negatively
correlated(!) - concept?
A: position: offensive/defensive
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 100
SVD - Case studies
...
Applications - success stories
multi-lingual IR; LSI queries
compression
PCA - ratio rules
Karhunen-Lowe transform
Kleinberg/google algorithm
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 101
K-L transform
[Duda & Hart]; [Fukunaga]

A subtle point:
SVD will give vectors that
go through the origin
v1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 102
K-L transform
A subtle point:
SVD will give vectors that
go through the origin
Q: how to find v
1
?
v1
v
1

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 103
K-L transform
A subtle point:
SVD will give vectors that
go through the origin
Q: how to find v
1
?

A: centered PCA, ie.,
move the origin to center
of gravity
v1
v
1

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 104
K-L transform
A subtle point:
SVD will give vectors that
go through the origin
Q: how to find v
1
?

A: centered PCA, ie.,
move the origin to center
of gravity
and THEN do SVD
v
1

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 105
SVD - outline
Motivation
Definition - properties
Interpretation
Complexity
Applications - success stories
Conclusions
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 106
SVD - Case studies
...
Applications - success stories
multi-lingual IR; LSI queries
compression
PCA - ratio rules
Karhunen-Lowe transform
Kleinberg/google algorithm
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 107
Kleinbergs algorithm
Problem dfn: given the web and a query
find the most authoritative web pages for
this query

Step 0: find all pages containing the query terms
Step 1: expand by one move forward and backward
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 108
Kleinbergs algorithm
Step 1: expand by one move forward and
backward
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 109
Kleinbergs algorithm
on the resulting graph, give high score (=
authorities) to nodes that many important
nodes point to
give high importance score (hubs) to
nodes that point to good authorities)
hubs authorities
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 110
Kleinbergs algorithm
observations
recursive definition!
each node (say, i-th node) has both an
authoritativeness score a
i
and a hubness
score h
i
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 111
Kleinbergs algorithm
Let A be the adjacency matrix:
the (i,j) entry is 1 if the edge from i to j exists
Let h and a be [n x 1] vectors with the
hubness and authoritativiness scores.
Then:
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 112
Kleinbergs algorithm
Then:
a
i
= h
k
+ h
l
+ h
m
that is
a
i
= Sum (h
j
) over all j that
(j,i) edge exists
or
a = A
T
h
k
l
m
i
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 113
Kleinbergs algorithm
symmetrically, for the hubness:
h
i
= a
n
+ a
p
+ a
q
that is
h
i
= Sum (q
j
) over all j that
(i,j) edge exists
or
h = A a
p
n
q
i
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 114
Kleinbergs algorithm
In conclusion, we want vectors h and a such
that:
h = A a
a = A
T
h
That is:
a = A
T
A a
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 115
Kleinbergs algorithm
a is a right- eigenvector of the adjacency
matrix A (by dfn!)

Starting from random a and iterating, well
eventually converge
(Q: to which of all the eigenvectors? why?)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 116
Kleinbergs algorithm
(Q: to which of all the eigenvectors? why?)
A: to the one of the strongest eigenvalue,
because of [Property 2]:
(A
T

A )

k
v ~ (constant) v
1


Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 117
Kleinbergs algorithm - results
Eg., for the query java:
0.328 www.gamelan.com
0.251 java.sun.com
0.190 www.digitalfocus.com (the java
developer)


Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 118
Kleinbergs algorithm -
discussion
authority score can be used to find similar
pages (how?)
closely related to citation analysis, social
networks / small world phenomena
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 119
google/page-rank algorithm
closely related: imagine a particle randomly
moving along the edges (*)
compute its steady-state probabilities



(*) with occasional random jumps
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 120
SVD - outline
Motivation
Definition - properties
Interpretation
Complexity
Applications - success stories
Conclusions
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 121
SVD - conclusions
SVD: a valuable tool - two major
properties:
(A) it gives the optimal low-rank
approximation (= least squares best hyper-
plane)
it finds concepts = principal components =
features; (linear) correlations
dim. reduction; visualization
(solves over- and under-constraint problems)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 122
SVD - conclusions - contd
(B) eigenvectors: fixed points
A
T
A v
i
= l
i
v
i

steady-state prob. in Markov Chains / graphs
hubs/authorities
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 123
Conclusions - in detail:
given a (document-term) matrix, it finds
concepts (LSI)
... and can reduce dimensionality (KL)
... and can find rules (PCA; RatioRules)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 124
Conclusions contd
... and can find fixed-points or steady-state
probabilities (Kleinberg / google/ Markov
Chains)
(... and can solve optimally over- and under-
constraint linear systems - see Appendix,
(Eq. C1) - Moore/Penrose pseudo-inverse)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 125
References
Berry, Michael: http://www.cs.utk.edu/~lsi/
Brin, S. and L. Page (1998). Anatomy of a Large-
Scale Hypertextual Web Search Engine. 7th Intl
World Wide Web Conf.
Chen, C. M. and N. Roussopoulos (May 1994).
Adaptive Selectivity Estimation Using Query
Feedback. Proc. of the ACM-SIGMOD ,
Minneapolis, MN.
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 126
References
Duda, R. O. and P. E. Hart (1973). Pattern
Classification and Scene Analysis. New York,
Wiley.
Faloutsos, C. (1996). Searching Multimedia
Databases by Content, Kluwer Academic Inc.
Foltz, P. W. and S. T. Dumais (Dec. 1992).
"Personalized Information Delivery: An Analysis
of Information Filtering Methods." Comm. of
ACM (CACM) 35(12): 51-60.
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 127
References
Fukunaga, K. (1990). Introduction to Statistical
Pattern Recognition, Academic Press.
Jolliffe, I. T. (1986). Principal Component
Analysis, Springer Verlag.
Kleinberg, J. (1998). Authoritative sources in a
hyperlinked environment. Proc. 9th ACM-SIAM
Symposium on Discrete Algorithms.
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 128
References
Korn, F., H. V. Jagadish, et al. (May 13-15, 1997).
Efficiently Supporting Ad Hoc Queries in Large
Datasets of Time Sequences. ACM SIGMOD,
Tucson, AZ.
Korn, F., A. Labrinidis, et al. (1998). Ratio Rules:
A New Paradigm for Fast, Quantifiable Data
Mining. VLDB, New York, NY.
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 129
References
Korn, F., A. Labrinidis, et al. (2000).
"Quantifiable Data Mining Using Ratio Rules."
VLDB Journal 8(3-4): 254-266.
Press, W. H., S. A. Teukolsky, et al. (1992).
Numerical Recipes in C, Cambridge University
Press.
Strang, G. (1980). Linear Algebra and Its
Applications, Academic Press.
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 130
Appendix: SVD properties
(A): obvious
(B): less obvious
(C): least obvious (and most powerful!)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 131
Properties - by defn.:
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

A(1): U
T
[r x n]
U
[n x r ]
= I
[r x r ]
(identity matrix)
A(2): V
T
[r x n]
V
[n x r ]
= I
[r x r ]

A(3): L
k
= diag( l
1
k
, l
2
k
, ... l
r
k
) (k: ANY real
number)
A(4): A
T
= V L U
T


Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 132
Less obvious properties
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

B(1): A
[n x m]
(A
T
)
[m x n]
= ??

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 133
Less obvious properties
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]
B(1): A
[n x m]
(A
T
)
[m x n]
= U L
2
U
T
symmetric; Intuition?

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 134
Less obvious properties
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]
B(1): A
[n x m]
(A
T
)
[m x n]
= U L
2
U
T
symmetric; Intuition?
document-to-document similarity matrix
B(2): symmetrically, for V
(A
T
)
[m x n]
A
[n x m]
= V L
2
V
T

Intuition?

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 135
Less obvious properties
A: term-to-term similarity matrix

B(3): ( (A
T
)
[m x n]
A
[n x m]
)

k
= V L
2k
V
T

and
B(4): (A
T

A )

k
~ v
1
l
1
2k
v
1
T
for k>>1
where
v
1
: [m x 1] first column (eigenvector) of V
l
1
: strongest eigenvalue

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 136
Less obvious properties
B(4): (A
T

A )

k
~ v
1
l
1
2k
v
1
T
for k>>1
B(5): (A
T

A )

k
v ~ (constant) v
1

ie., for (almost) any v, it converges to a
vector parallel to v
1
Thus, useful to compute first
eigenvector/value (as well as the next ones,
too...)

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 137
Less obvious properties -
repeated:
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

B(1): A
[n x m]
(A
T
)
[m x n]
= U L
2
U
T
B(2): (A
T
)
[m x n]
A
[n x m]
= V L
2
V
T

B(3): ( (A
T
)
[m x n]
A
[n x m]
)

k
= V L
2k
V
T
B(4): (A
T

A )

k
~ v
1
l
1
2k
v
1
T

B(5): (A
T

A )

k
v ~ (constant) v
1


Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 138
Least obvious properties
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

C(1): A
[n x m]
x
[m x 1]
= b
[n x 1]

let x
0
= V L
(-1)
U
T
b
if under-specified, x
0
gives shortest solution
if over-specified, it gives the solution with the
smallest least squares error
(see Num. Recipes, p. 62)
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 139
Least obvious properties
Illustration: under-specified, eg
[1 2] [w z]
T
= 4 (ie, 1 w + 2 z = 4)
1
2 3 4
1
2
all possible solutions
x
0

w
z
shortest-length solution
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 140
Least obvious properties
Illustration: over-specified, eg
[3 2]
T
[w] = [1 2]
T
(ie, 3 w = 1; 2 w = 2 )
1
2 3 4
1
2
reachable points (3w, 2w)
desirable point b
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 141
Least obvious properties - contd
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

C(2): A
[n x m]
v
1 [m x 1]
= l
1
u
1

[n x 1]
where v
1
, u
1
the first (column) vectors of V,
U. (v
1
== right-eigenvector)
C(3): symmetrically: u
1
T
A = l
1
v
1
T
u
1
== left-eigenvector
Therefore:

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 142
Least obvious properties - contd
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

C(4): A
T
A v
1
= l
1
2
v
1
(fixed point - the usual dfn of eigenvector for a
symmetric matrix)

Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 143
Least obvious properties -
altogether
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

C(1): A
[n x m]
x
[m x 1]
= b
[n x 1]

then, x
0
= V L
(-1)
U
T
b: shortest, actual or least-
squares solution
C(2): A
[n x m]
v
1 [m x 1]
= l
1
u
1

[n x 1]
C(3): u
1
T
A = l
1
v
1
T

C(4): A
T
A v
1
= l
1
2
v
1
Carnegie Mellon
MDIC-04 Copyright: C. Faloutsos (2004) 144
Properties - conclusions
A(0): A
[n x m]
= U
[ n x r ]
L
[ r x r ]
V
T

[ r x m]

B(5): (A
T

A )

k
v ~ (constant) v
1

C(1): A
[n x m]
x
[m x 1]
= b
[n x 1]

then, x
0
= V L
(-1)
U
T
b: shortest, actual or least-
squares solution
C(4): A
T
A v
1
= l
1
2
v
1

Das könnte Ihnen auch gefallen