Sie sind auf Seite 1von 56

# CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

## Compress / reduce dimensionality:

The above matrix is really 2-dimensional. All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

Assumption: Data lies on or near a low d-dimensional subspace Axes of this subspace are effective representation of the data

## Why reduce dimensions? Discover hidden correlations/topics

Remove redundant and noisy features Interpretation and visualization Easier storage and processing of the data
Not all words are useful

1/23/2013

## A: Input data matrix

m x n matrix (e.g., m documents, n terms) m x r matrix (m documents, r concepts) r x r diagonal matrix (strength of each concept) (r : rank of the matrix A)

1/23/2013

## n x r matrix (n terms, r concepts)

Jure Leskovec, Stanford C246: Mining Massive Datasets 5

VT

1/23/2013

## Jure Leskovec, Stanford C246: Mining Massive Datasets

1u1v1

2u2v2

+
i scalar ui vector vi vector

1/23/2013

## Jure Leskovec, Stanford C246: Mining Massive Datasets

It is always possible to decompose a real matrix A into A = U VT , where U, , V: unique U, V: column orthonormal : diagonal
UT U = I; VT V = I (I: identity matrix) (Columns are orthogonal unit vectors) Entries (singular values) are positive, and sorted in decreasing order (1 2 ... 0)
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

Casablanca

## A = U VT - example: Users to Movies

Serenity Alien Amelie

Matrix

SciFi

Romnce

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

VT

9

1/23/2013

Casablanca

## A = U VT - example: Users to Movies

Serenity Alien Amelie

Matrix

SciFi

Romnce

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

Casablanca

## A = U VT - example: Users to Movies

Serenity Amelie SciFi-concept Romance-concept Alien

Matrix

SciFi

Romnce

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

Casablanca

A = U VT - example:
Serenity Amelie Alien SciFi-concept

Matrix

Romance-concept

SciFi

Romnce

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

Casablanca

A = U VT - example:
Serenity Amelie Alien SciFi-concept strength of the SciFi-concept

Matrix

SciFi

Romnce

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

Casablanca

A = U VT - example:
Serenity Amelie Alien SciFi-concept

Matrix

SciFi

Romnce

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

SciFi-concept
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
14

## movies, users and concepts:

U: user-to-concept similarity matrix V: movie-to-concept similarity matrix : its diagonal elements: strength of each concept

1/23/2013

15

## SVD gives best axis to project on:

best = min sum of squares of projection errors
Movie 2 rating first right singular vector

## In other words, minimum reconstruction error

v1 Movie 1 rating

1/23/2013

16

## V: movie-to-concept matrix U: user-to-concept matrix

1 3 4 5 0 0 0 1 3 4 5 2 0 1
1/23/2013

Movie 2 rating

A = U VT - example:

v1
Movie 1 rating

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
17

## Jure Leskovec, Stanford C246: Mining Massive Datasets

Movie 2 rating

A = U VT - example:
variance (spread) on the v1 axis

v1
Movie 1 rating

1 3 4 5 0 0 0

1 3 4 5 2 0 1
1/23/2013

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
18

## U : Gives the coordinates of the points in the projection axis

1 3 4 5 0 0 0 0 0 0 0 4 5 2 0 0 0 0 4 5 2 Projection of users on the Sci-Fi axis ((U ) ):

Movie 2 rating

A = U VT - example:

v1
Movie 1 rating

1 3 4 5 0 0 0

1 3 4 5 2 0 1
1/23/2013

19

1 3 4 5 0 0 0

1 3 4 5 2 0 1
1/23/2013

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
20

## Jure Leskovec, Stanford C246: Mining Massive Datasets

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero
1 3 4 5 0 0 0 1 3 4 5 2 0 1
1/23/2013

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
21

## Jure Leskovec, Stanford C246: Mining Massive Datasets

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero
1 3 4 5 0 0 0 1 3 4 5 2 0 1
1/23/2013

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
22

## Jure Leskovec, Stanford C246: Mining Massive Datasets

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero
1 3 4 5 0 0 0 1 3 4 5 2 0 1
1/23/2013

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 12.4 0 0 0 9.5 0 0 0 1.3

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
23

## Jure Leskovec, Stanford C246: Mining Massive Datasets

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero
1 3 4 5 0 0 0 1 3 4 5 2 0 1
1/23/2013

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

## 0.02 0.07 0.09 0.11 -0.59 -0.73 -0.29

12.4 0 0 9.5

0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69
24

## Jure Leskovec, Stanford C246: Mining Massive Datasets

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero
1 3 4 5 0 0 0 1 3 4 5 2 0 1 1 3 4 5 0 0 0 0 0 0 0 4 5 2 0 0 0 0 4 5 2

## 0.01 -0.01 0.01 0.03 4.11 4.78 2.01

MF = ij Mij2
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

Frobenius norm:

A-BF = ij (Aij-Bij)2
is small
25

Sigma

VT

B is best approximation of A
Sigma

VT
26

1/23/2013

(12, rank(A)=r)

## is a best rank-k approximation to A:

11

S = diagonal nxn matrix where si=i (i=1k) else si=0 B is a solution to minB A-BF where rank(B)=k

## We will need 2 facts:

U VT - U S VT = U ( - S) VT

1/23/2013

where M = P Q R is SVD of M
27

## We will need 2 facts:

where M = P Q R is SVD of M

U VT - U S VT = U ( - S) VT
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

## We apply: -- P column orthonormal -- R row orthonormal -- Q is diagonal

28

A = U VT , B = U S VT (12 0, rank(A)=r)

## then B is solution to minB A-BF , rank(B)=k Why? r 2 min A B F = min S F = min si ( i si )

B , rank ( B ) = k

## S = diagonal nxn matrix where si=i (i=1k) else si=0

We used: U

VT -

US

VT

= U ( - S)

VT

i =1

We want to choose si to minimize Solution is to set si=i (i=1k) and other si=0
= min si ( i si ) +
2 i =1 k i = k +1

i = k +1

i
29

1/23/2013

## Equivalent: spectral decomposition of the matrix:

1 3 4 5 0 0 0 1 3 4 5 2 0 1 1 3 4 5 0 0 0 0 0 0 0 4 5 2 0 0 0 0 4 5 2

u1

u2

1 2 v1 v2

1/23/2013

30

## Equivalent: spectral decomposition of the matrix

m

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

k terms

u1 vT1 +

u2 vT2 +...

Assume: 1 2 3 ... 0 Why is setting small i to 0 the right thing to do? Vectors ui and vi are unit length, so i scales them. So, zeroing small i introduces less error.

nx1

1xm

1/23/2013

31

## Q: How many s to keep? A: Rule-of-a thumb: keep 80-90% of energy (=i2)

m

1 3 4 5 0 0 0
1/23/2013

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

u1 vT1 +

u2 vT2 +...

Assume: 1 2 3 ...
32

## To compute SVD: But:

O(nm2) or O(n2m) (whichever is less) Less work, if we just want singular values or if we want first k singular vectors or if the matrix is sparse

## Implemented in linear algebra packages like

LINPACK, Matlab, SPlus, Mathematica ...

1/23/2013

33

## SVD: A= U VT: unique

U: user-to-concept similarities V: movie-to-concept similarities : strength of each concept

Dimensionality reduction:
keep the few largest singular values (80-90% of energy) SVD: picks up linear correlations

1/23/2013

34

## SVD gives us: Eigen-decomposition:

A = X XT
A is symmetric U, V, X are orthonormal (UTU=I), , are diagonal

A = U VT

What is:

## AAT= U VT(U VT)T = U VT(VTUT) = UT UT ATA = V T UT (U VT) = V T VT

1/24/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

## SVD gives us: Eigen-decomposition:

A = X XT
A is symmetric U, V, X are orthonormal (UTU=I), , are diagonal

A = U VT

What is:

X XT

## AAT= U VT(U VT)T = U VT(VTUT) = UT UT ATA = V T UT (U VT) = V T VT

X XT
1/24/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

So, i = i2
36

Casablanca

Q: Find users that like Matrix A: Map query into a concept space how?
Serenity Alien Amelie

Matrix

SciFi

Romnce

1 3 4 5 0 0 0

1 3 4 5 2 0 1

1 3 4 5 0 0 0

0 0 0 0 4 5 2

0 0 0 0 4 5 2

1/23/2013

## Jure Leskovec, Stanford C246: Mining Massive Datasets

-0.01 -0.03 -0.04 -0.05 x 0.65 -0.67 0.32 0.56 0.12 0.40

## 12.4 0 0 0 9.5 0 0 0 1.3

0.59 0.56 0.09 0.09 -0.02 0.12 -0.69 -0.69 -0.80 0.40 0.09 0.09
38

Q: Find users that like Matrix A: Map query into a concept space how?
Casablanca

Alien

Serenity

Alien

Amelie

Matrix

q= 5 0 0 0 0
Project into concept space: Inner product with each concept vector vi
1/23/2013

v2 v1 Matrix

## Jure Leskovec, Stanford C246: Mining Massive Datasets

39

Q: Find users that like Matrix A: Map query into a concept space how?
Casablanca

Alien

Serenity

Alien

Amelie

Matrix

q= 5 0 0 0 0
Project into concept space: Inner product with each concept vector vi
1/23/2013

v2 v1 q*v1 Matrix

40

## Compactly, we have: qconcept = q V E.g.:

Casablanca Serenity Amelie Matrix

q= 5 0 0 0 0

SciFi-concept

Alien

2.8

0.6

1/23/2013

## Jure Leskovec, Stanford C246: Mining Massive Datasets

41

How would the user d that rated (Alien, Serenity) be handled? dconcept = d V E.g.:

q= 0 4 5 0 0

SciFi-concept

Alien

5.2

0.4

1/23/2013

## Jure Leskovec, Stanford C246: Mining Massive Datasets

42

Observation: User d that rated (Alien, Serenity) will be similar to user q that rated (Matrix), although d and q have zero ratings in common!
Casablanca Serenity Amelie Matrix Alien SciFi-concept

d= q=

0 4

2.8 5.2

0.6 0.4

5 0

## Zero ratings in common

1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

Similarity 0
43

## Optimal low-rank approximation in terms of Frobenius norm - Interpretability problem:

+

- Lack of sparsity:
=

A singular vector specifies a linear combination of all input columns or rows Singular vectors are dense!
VT

U
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 44

Announcements:

HW2 has been posted LSH Gradiance quiz has been posted. Due 2013-01-30 23:59 Date for an alternate final: Tue 3/19 6-9PM

1/23/2013

## Jure Leskovec, Stanford C246: Mining Massive Datasets

45

Frobenius norm:

XF = ij Xij2

Goal: Express A as a product of matrices C,U,R Make A-CURF small Constraints on C and R:

A
1/24/2013

C
Jure Leskovec, Stanford C246: Mining Massive Datasets

R
46

Frobenius norm:

XF = ij Xij2

Goal: Express A as a product of matrices C,U,R Make A-CURF small Constraints on C and R:

## Pseudo-inverse of the intersection of C and R

A
1/24/2013

C
Jure Leskovec, Stanford C246: Mining Massive Datasets

R
47

## Theorem [Drineas et al.] CUR in O(mn) time achieves

A-CURF A-AkF + AF with probability at least 1-, by picking
O(k log(1/)/2) columns, and O(k2 log3(1/)/6) rows
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

48

1/23/2013

49

## Let W be the intersection of sampled columns C and rows R Then: U = W+ = Y Z+ XT

+: reciprocals of non-zero singular values: +ii =1/ ii W+ is the pseudoinverse
A

Let SVD of W = X Z YT

W C

U = W+
1/23/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

Why pseudoinverse works? W = X Z Y then W-1 = X-1 Z-1 Y-1 Due to orthonomality X-1=XT and Y-1=YT Since Z is diagonal Z-1 = 1/Zii Thus, if W is nonsingular, pseudoinverse is the true inverse
50

## Easy interpretation Sparse basis

Since the basis vectors are actual columns and rows Since the basis vectors are actual columns and rows
Actual column Singular vector

1/23/2013

51

## If we want to get rid of the duplicates:

Throw them away Scale (multiply) the columns/rows by the square root of the number of duplicates

Rd A Cd

Rs Cs

Construct a small U

1/23/2013

52

SVD: A = U
Huge but sparse

T V

CUR: A = C U R
Huge but sparse
1/23/2013

53

## DBLP bibliographic data

Author-to-conference big sparse matrix Aij: Number of papers published by author i at conference j 428K authors (rows), 3659 conferences (columns)

## Want to reduce dimensionality

How much time does it take? What is the reconstruction error? How much space do we need?

Very sparse

1/23/2013

54

Accuracy:

## Space ratio: CPU time

1 relative sum squared errors #output matrix entries / #input matrix entries
Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM 07.
Jure Leskovec, Stanford C246: Mining Massive Datasets 55

1/23/2013

## SVD is limited to linear projections: Non-linear methods: Isomap

How?
Build adjacency graph Geodesic distance is graph distance SVD/PCA the graph pairwise distance matrix

Lowerdimensional linear projection that preserves Euclidean distances Data lies on a nonlinear lowdim curve aka manifold
Use the distance as measured along the manifold

1/23/2013

## Jure Leskovec, Stanford C246: Mining Massive Datasets

56

Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006. J. Sun, Y. Xie, H. Zhang, C. Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM 2007 Intra- and interpopulation genotype reconstruction from tagging SNPs, P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas, Genome Research, 17(1), 96-107 (2007) Tensor-CUR Decompositions For Tensor-Based Data, M. W. Mahoney, M. Maggioni, and P. Drineas, Proc. 12-th Annual SIGKDD, 327-336 (2006)
Jure Leskovec, Stanford C246: Mining Massive Datasets 57

1/23/2013