Sie sind auf Seite 1von 48

An Introduction to Sparse Approximation

Anna C. Gilbert
Department of Mathematics
University of Michigan

Basic image/signal/data compression: transform coding

Approximate signals sparsely

Compress images, signals, data

accurately (mathematics)
concisely (statistics)
efficiently (algorithms)

Redundancy

If one orthonormal basis is good, surely two (or more) are better...

Redundancy

If one orthonormal basis is good, surely two (or more) are better...

...especially for images

Dictionary

Definition
A dictionary D in Rn is a collection {` }d`=1 Rn of unit-norm
vectors: k` k2 = 1 for all `.
Elements are called atoms
If span{` } = Rn , the dictionary is complete

If {` } are linearly dependent, the dictionary is redundant

Matrix representation
Form a matrix



= 1 2 . . . d

so that

c =

c ` ` .

Examples: FourierDirac
= [F |I ]
1
` (t) = e2i`t/n ` = 1, 2, . . . , n
n
` (t) = ` (t) ` = n + 1, n + 2, . . . , 2n

Sparse Problems
Exact. Given a vector x Rn and a complete dictionary ,
solve
min kck0 s.t. x = c
c

i.e., find a sparsest representation of x over .


Error. Given  0, solve
min kck0
c

s.t.

kx ck2 

i.e., find a sparsest approximation of x that achieves error .


Sparse. Given k 1, solve
min kx ck2
c

s.t.

kck0 k

i.e., find the best approximation of x using k atoms.

NP-hardness

Theorem
Given an arbitrary redundant dictionary and a signal x, it is
NP-hard to solve the sparse representation problem Sparse.
[Natarajan95,Davis97]

Corollary
Error is NP-hard as well.

Corollary
It is NP-hard to determine if the optimal error is zero for a given
sparsity level k.

Exact Cover by 3-sets: X3C


Definition
Given a finite universe U, a collection X of subsets X1 , X2 , . . . , Xd
s.t. |Xi | = 3 for each i, does X contain a disjoint collection of
subsets whose union = U?
Classic NP-hard problem.

Proposition
Any instance of X3C is reducible in polynomial time to Sparse.
X1

X2

X3

XN

Bad news, Good news


Bad news
Given any polynomial time algorithm for Sparse, there is a
dictionary and a signal x such that algorithm returns
incorrect answer
Pessimistic: worst case
Cannot hope to approximate solution, either

Bad news, Good news


Bad news
Given any polynomial time algorithm for Sparse, there is a
dictionary and a signal x such that algorithm returns
incorrect answer
Pessimistic: worst case
Cannot hope to approximate solution, either
Good news
Natural dictionaries are far from arbitrary
Perhaps natural dictionaries admit polynomial time algorithms
Optimistic: rarely see worst case
Leverage our intuition from orthogonal basis

Hardness depends on instance

Redundant dictionary

input signal x

NP-hard

arbitrary

arbitrary

depends on
choice of

fixed

fixed

random
(distribution?)

random
(distribution?)

example: spikes and sines

compressive
sensing

random
signal model

Sparse algorithms: exploit geometry

Orthogonal case: pull off atoms one at a time, with dot


products in decreasing magnitude

Sparse algorithms: exploit geometry

Orthogonal case: pull off atoms one at a time, with dot


products in decreasing magnitude
Why is orthogonal case easy?
inner products between atoms are small
its easy to tell which one is the best choice

When atoms are (nearly) parallel, cant tell which one is best

Coherence
Definition
The coherence of a dictionary
= max | hj , ` i |
j6=`

Small coherence
(good)

Large coherence
(bad)

Large, incoherent dictionaries


FourierDirac, d = 2n, =

1
n

wavelet packets, d = n log n, =

1
2

There are large dictionaries with coherence close to the lower

(Welch) bound; e.g., Kerdock codes, d = n2 , = 1/ n


0.35

0.3

0.25

0.2

0.15

0.1

0.05

Greedy algorithms

Build approximation one step at a time...


...choose the best atom at each step

Orthogonal Matching Pursuit OMP

[Mallat93, Davis97]

Input. Dictionary , signal x, steps k


Output. Coefficient vector c with k nonzeros, c x
Initialize. counter t = 1, c = 0
1. Greedy selection.
`t = argmax| (x c)|
2. Update. Find c`1 , . . . , c`t to solve




X


min x
c`s `s


s

new approximation at c

3. Iterate. t t + 1, stop when t > k.

Many greedy algorithms with similar outline

Matching Pursuit: replace step 2. by


c`t c`t + hx c, `t i
Thresholding
Choose m atoms where | hx, ` i | are among m largest

Alternate stopping rules:


kx ck2 
max` | hx c, ` i | 

Many other variations

Convergence of OMP
Theorem
Suppose is a complete dictionary for Rn . For any vector x, the
residual after t steps of OMP satisfies
C
kx ck2 .
t
[Devore-Temlyakov]

Convergence of OMP
Theorem
Suppose is a complete dictionary for Rn . For any vector x, the
residual after t steps of OMP satisfies
C
kx ck2 .
t
[Devore-Temlyakov]

Even if x can be expressed sparsely, OMP may take n steps


before the residual is zero.
But, sometimes OMP correctly identifies sparse
representations.

Exact Recovery Condition and coherence


Theorem (ERC)
A sufficient condition for OMP to identify after k steps is that



max +
` 1 < 1
`
/

where A+ = (A A)1 A .

[Tropp04]

Theorem
The ERC holds whenever k < 12 (1 + 1). Therefore, OMP can
recover any sufficiently sparse signals. [Tropp04]

For most redundant dictionaries, k < 12 ( n + 1).

Sparse representation with OMP


Suppose x has k-sparse representation
X
x=
b` ` where || = k
`

Sufficient to find When can OMP do so?


Define

= `1

= `1

`2

`k

`2

`Nk

Define greedy selection ratio


(r ) =

`s

and

`s
/

k r k
max i.p. bad atoms
max`
/ | hr , ` i |
=
=



max` | hr , ` i |
max i.p. good atoms
r

OMP chooses good atom iff (r ) < 1

Sparse
Theorem
1
. For any vector x, the approximation b
x after k
Assume k 3
steps of OMP satisfies

kx b
x k2 1 + 6k kx xk k2

where xk is the best k-term approximation to x.

[Tropp04]

Theorem

Assume 4 k

1 .

Two-phase greedy pursuit produces b


x s.t.

kx b
x k2 3 kx xk k2 .

Assume k 1 . Two-phase greedy pursuit produces b


x s.t.

kx b
x k2 1 +

[Gilbert, Strauss, Muthukrishnan, Tropp03]

2k 2 
kx xk k2 .
(1 2k)2

Alternative algorithmic approach


Exact: non-convex optimization
min kck0

s.t.

x = c

Alternative algorithmic approach


Exact: non-convex optimization
min kck0

s.t.

x = c

Convex relaxation of non-convex problem


min kck1

s.t.

x = c

Error: non-convex optimization


arg min kck0

s.t.

kx ck2 

Alternative algorithmic approach


Exact: non-convex optimization
min kck0

s.t.

x = c

Convex relaxation of non-convex problem


min kck1

s.t.

x = c

Error: non-convex optimization


arg min kck0

s.t.

kx ck2 

Convex relaxation of non-convex problem


arg min kck1

s.t.

kx ck2 .

Convex relaxation: algorithmic formulation

Well-studied algorithmic formulation

[Donoho, Donoho-Elad-Temlyakov, Tropp,

and many others]

Optimization problem = linear program: linear objective


function (with variables c + , c ) and linear or quadratic
constraints
Still need algorithm for solving optimization problem
Hard part of analysis: showing solution to convex problem =
solution to original problem

Exact Recovery Condition


Theorem (ERC)
A sufficient condition for BP to recover the sparsest representation
of x is that



max +
` 1 < 1
`
/

where

A+

(AT A)1 AT .

[Tropp04]

Exact Recovery Condition


Theorem (ERC)
A sufficient condition for BP to recover the sparsest representation
of x is that



max +
` 1 < 1
`
/

where

A+

Theorem

(AT A)1 AT .

[Tropp04]

The ERC holds whenever k < 12 (1 + 1). Therefore, BP can


recover any sufficiently sparse signals. [Tropp04]

Alternate optimization formulations

Constrained minimization:
arg min kck1

s.t.

kx ck2 .

Unconstrained minimization:
minimize L(c; , x) =

1
kx ck22 + kck1 .
2

Many algorithms for `1 -regularization

Sparse approximation: Optimization vs. Greedy

Exact and Error amenable to convex relaxation and


convex optimization

Sparse not amenable to convex relaxation


arg min kc xk2

s.t.

but appropriate for greedy algorithms

kck0 k

Connection between...

Sparse Approximation and Statistical Learning

Sparsity in statistical learning

=
X

responses

Xj = (X1j , . . . , XN j )T

predictor variables

Goal: Given X and y , find and coeffs. Rp for linear


model that minimizes the error
= arg min kX (y )k22 .
(
, )
Solution:
Least squares: low bias but large variance and hard to interpret
lots of non-zero coefficients
Shrink j s, make sparse.

Algorithms in statistical learning


Brute force: Calculate Mallows Cp for every subset of
predictor variables, and choose the best one.
Greedy algorithms: Forward selection, forward stagewise,
least angle regression (LARS), backward elimination.
Constrained optimization: Quadratic programming problem
with linear constraints (e.g., LASSO).
Unconstrained optimization: regularization techniques
Sparse approximation and SVM equivalence

[Girosi 96]

Connection between...

Sparse Approximation and Compressed Sensing


=
data/image

measurements/coeffs.

Interchange roles

Problem statement (TCS Perspective)

Assume x has
low complexity:
x is k-sparse
(with noise)

m as small
as possible
Construct
Matrix : Rn Rm

Decoding algorithm D
Given x for any signal x Rn , we can, with high probability,
quickly recover b
x with
kx b
x kp (1 + )

min

y ksparse

kx y kq = (1 + )kx xk kq

Comparison with Sparse Approximation

Sparse: Given y and , find (sparse) x such that y = x.


Return b
x with guarantee
kb
x y k2

small compared with ky xk k2 .

CS: Given y and , find (sparse) x such that y = x. Return


b
x with guarantee
kb
x xkp

small compared with kx xk kq .

p and q not always the same, not always = 2.

Comparison with Statistical Learning


N

=
X

responses

Xj = (X1j , . . . , XN j )

predictor variables

Goal: Given X and y , find and coeffs. Rp for linear


model that minimizes the error
= arg min kX (y )k22 .
(
, )
Statistics: X drawn iid from distribution (i.e., cheap
generation), characterize mathmematical performance as a
function of distribution
TCS: X (or distribution) is constructed (i.e., expensive),
characterize algorithmic performance as a function of space,
time, and randomness

Analogy: root-finding
p
!
p! with

with

|!
p p| !

|f (!
p) 0| !

root = p

Sparse: Given f (and y = 0), find p such that f (p) = 0.


Return b
p with guarantee
|f (b
p ) 0| small.

CS: Given f (and y = 0), find p such that f (p) = 0. Return


b with guarantee
p
|b
p p| small.

Parameters

1. Number of measurements m
2. Recovery time
3. Approximation guarantee (norms, mixed)
4. One matrix vs. distribution over matrices
5. Explicit construction
6. Universal matrix (for any basis, after measuring)
7. Tolerance to measurement noise

Applications

Applications

Data stream algorithms


xi = number of items with index i
* Data stream algorithms
xi = number of
items
with index
i
Ax under
can
maintain
increments to x and recover
can maintain x
under increments
approximation
to x to x

recover approximation to x

Efficient data sensing


* Efficient data sensing
cameras
digital/analog digital
cameras
analog-to-digital
converters
analog-to-digital
converters
high throughput biological screening
Error-correcting codes
(pooling designs)

code {y Rn | Ay = 0}
* Error-correcting codes
code {y Rn |y
0} vector, Ax = syndrome
x ==
error
x = error vector, x = syndrome

Two approaches
Geometric

[Donoho 04],[Candes-Tao 04, 06],[Candes-Romberg-Tao 05],

[Rudelson-Vershynin 06], [Cohen-Dahmen-DeVore 06], and many others...

Dense recovery matrices that satisfy RIP (e.g., Gaussian,


Fourier)
Geometric recovery methods (`1 minimization, LP)
b
x = argminkzk1 s.t. z = x

Uniform guarantee: one matrix A that works for all x

Combinatorial

[Gilbert-Guha-Indyk-Kotidis-Muthukrishnan-Strauss 02],

[Charikar-Chen-FarachColton 02] [Cormode-Muthukrishnan 04],


[Gilbert-Strauss-Tropp-Vershynin 06, 07]

Sparse random matrices (typically)


Combinatorial recovery methods or weak, greedy algorithms
Per-instance guarantees, later uniform guarantees

Summary

* Sparse approximation, statistical learning, and compressive


sensing intimately related
* Many models of computation and scientific/technological
problems in which they all arise
* Algorithms for all similar: optimization and greedy
* Community progress on geometric and statistical models for
matrices and signals x, different problem instance types
* Explicit constructions?
* Better/different geometric/statistical models?
* Better connections with coding and complexity theory?

Das könnte Ihnen auch gefallen