Gilbert LDHD Tutorial Sept2013

An Introduction to Sparse Approximation
Anna C. Gilbert
Department of Mathematics
University of Michigan
Basic image/signal/data compression: transform coding
Approximate signals sparsely
Compress images, signals, data
accurately (mathematics)
concisely (statistics)
efficiently (algorithms)
Redundancy
If one orthonormal basis is good, surely two (or more) are better...
Redundancy
If one orthonormal basis is good, surely two (or more) are better...
...especially for images
Dictionary
Definition
A dictionary D in Rn is a collection {` }d`=1 Rn of unit-norm
vectors: k` k2 = 1 for all `.
Elements are called atoms
If span{` } = Rn , the dictionary is complete
If {` } are linearly dependent, the dictionary is redundant
Matrix representation
Form a matrix

= 1 2 . . . d
so that
c =
c ` ` .
Examples: FourierDirac
= [F |I ]
1
` (t) = e2i`t/n ` = 1, 2, . . . , n
n
` (t) = ` (t) ` = n + 1, n + 2, . . . , 2n
Sparse Problems
Exact. Given a vector x Rn and a complete dictionary ,
solve
min kck0 s.t. x = c
c
i.e., find a sparsest representation of x over .

Error. Given 0, solve
min kck0
c
s.t.
kx ck2
i.e., find a sparsest approximation of x that achieves error .

Sparse. Given k 1, solve
min kx ck2
c
s.t.
kck0 k
i.e., find the best approximation of x using k atoms.
NP-hardness
Theorem
Given an arbitrary redundant dictionary and a signal x, it is
NP-hard to solve the sparse representation problem Sparse.
[Natarajan95,Davis97]
Corollary
Error is NP-hard as well.
Corollary
It is NP-hard to determine if the optimal error is zero for a given
sparsity level k.
Exact Cover by 3-sets: X3C

Definition
Given a finite universe U, a collection X of subsets X1 , X2 , . . . , Xd
s.t. |Xi | = 3 for each i, does X contain a disjoint collection of
subsets whose union = U?
Classic NP-hard problem.
Proposition
Any instance of X3C is reducible in polynomial time to Sparse.
X1
X2
X3
XN
Bad news, Good news

Bad news
Given any polynomial time algorithm for Sparse, there is a
dictionary and a signal x such that algorithm returns
incorrect answer
Pessimistic: worst case
Cannot hope to approximate solution, either
Bad news, Good news

Bad news
Given any polynomial time algorithm for Sparse, there is a
dictionary and a signal x such that algorithm returns
incorrect answer
Pessimistic: worst case
Cannot hope to approximate solution, either
Good news
Natural dictionaries are far from arbitrary
Perhaps natural dictionaries admit polynomial time algorithms
Optimistic: rarely see worst case
Leverage our intuition from orthogonal basis
Hardness depends on instance
Redundant dictionary
input signal x
NP-hard
arbitrary
arbitrary
depends on
choice of
fixed
fixed
random
(distribution?)
random
(distribution?)
example: spikes and sines
compressive
sensing
random
signal model
Sparse algorithms: exploit geometry
Orthogonal case: pull off atoms one at a time, with dot

products in decreasing magnitude
Sparse algorithms: exploit geometry
Orthogonal case: pull off atoms one at a time, with dot

products in decreasing magnitude
Why is orthogonal case easy?
inner products between atoms are small
its easy to tell which one is the best choice
When atoms are (nearly) parallel, cant tell which one is best
Coherence
Definition
The coherence of a dictionary
= max | hj , ` i |
j6=`
Small coherence
(good)
Large coherence
(bad)
Large, incoherent dictionaries

FourierDirac, d = 2n, =
1
n
wavelet packets, d = n log n, =
1
2
There are large dictionaries with coherence close to the lower
(Welch) bound; e.g., Kerdock codes, d = n2 , = 1/ n

0.35
0.3
0.25
0.2
0.15
0.1
0.05
Greedy algorithms
Build approximation one step at a time...

...choose the best atom at each step
Orthogonal Matching Pursuit OMP
[Mallat93, Davis97]
Input. Dictionary , signal x, steps k

Output. Coefficient vector c with k nonzeros, c x
Initialize. counter t = 1, c = 0
1. Greedy selection.
`t = argmax| (x c)|
2. Update. Find c`1 , . . . , c`t to solve

X

min x
c`s `s

s
new approximation at c
3. Iterate. t t + 1, stop when t > k.
Many greedy algorithms with similar outline
Matching Pursuit: replace step 2. by

c`t c`t + hx c, `t i
Thresholding
Choose m atoms where | hx, ` i | are among m largest
Alternate stopping rules:

kx ck2
max` | hx c, ` i |
Many other variations
Convergence of OMP
Theorem
Suppose is a complete dictionary for Rn . For any vector x, the
residual after t steps of OMP satisfies
C
kx ck2 .
t
[Devore-Temlyakov]
Convergence of OMP
Theorem
Suppose is a complete dictionary for Rn . For any vector x, the
residual after t steps of OMP satisfies
C
kx ck2 .
t
[Devore-Temlyakov]
Even if x can be expressed sparsely, OMP may take n steps

before the residual is zero.
But, sometimes OMP correctly identifies sparse
representations.
Exact Recovery Condition and coherence

Theorem (ERC)
A sufficient condition for OMP to identify after k steps is that

max +
` 1 < 1
`
/
where A+ = (A A)1 A .
[Tropp04]
Theorem
The ERC holds whenever k < 12 (1 + 1). Therefore, OMP can
recover any sufficiently sparse signals. [Tropp04]
For most redundant dictionaries, k < 12 ( n + 1).
Sparse representation with OMP

Suppose x has k-sparse representation
X
x=
b` ` where || = k
`
Sufficient to find When can OMP do so?

Define

= `1

= `1
`2
`k
`2
`Nk
Define greedy selection ratio

(r ) =
`s
and
`s
/
k r k
max i.p. bad atoms
max`
/ | hr , ` i |
=
=

max` | hr , ` i |
max i.p. good atoms
r
OMP chooses good atom iff (r ) < 1
Sparse
Theorem
1
. For any vector x, the approximation b
x after k
Assume k 3
steps of OMP satisfies
kx b
x k2 1 + 6k kx xk k2
where xk is the best k-term approximation to x.
[Tropp04]
Theorem
Assume 4 k
1 .
Two-phase greedy pursuit produces b

x s.t.
kx b
x k2 3 kx xk k2 .
Assume k 1 . Two-phase greedy pursuit produces b

x s.t.

kx b
x k2 1 +
[Gilbert, Strauss, Muthukrishnan, Tropp03]
2k 2
kx xk k2 .
(1 2k)2
Alternative algorithmic approach

Exact: non-convex optimization
min kck0
s.t.
x = c

min kck0
s.t.
x = c
Convex relaxation of non-convex problem

min kck1
s.t.
x = c
Error: non-convex optimization

arg min kck0
s.t.
kx ck2

min kck0
s.t.
x = c

min kck1
s.t.
x = c
Error: non-convex optimization

arg min kck0
s.t.
kx ck2

arg min kck1
s.t.
kx ck2 .
Convex relaxation: algorithmic formulation
Well-studied algorithmic formulation
[Donoho, Donoho-Elad-Temlyakov, Tropp,
and many others]
Optimization problem = linear program: linear objective

function (with variables c + , c ) and linear or quadratic
constraints
Still need algorithm for solving optimization problem
Hard part of analysis: showing solution to convex problem =
solution to original problem
Exact Recovery Condition

Theorem (ERC)
A sufficient condition for BP to recover the sparsest representation
of x is that

max +
` 1 < 1
`
/
where
A+
(AT A)1 AT .
[Tropp04]
Exact Recovery Condition

Theorem (ERC)
A sufficient condition for BP to recover the sparsest representation
of x is that

max +
` 1 < 1
`
/
where
A+
Theorem
(AT A)1 AT .
[Tropp04]
The ERC holds whenever k < 12 (1 + 1). Therefore, BP can

recover any sufficiently sparse signals. [Tropp04]
Alternate optimization formulations
Constrained minimization:
arg min kck1
s.t.
kx ck2 .
Unconstrained minimization:
minimize L(c; , x) =
1
kx ck22 + kck1 .
2
Many algorithms for `1 -regularization
Sparse approximation: Optimization vs. Greedy
Exact and Error amenable to convex relaxation and

convex optimization
Sparse not amenable to convex relaxation

arg min kc xk2
s.t.
but appropriate for greedy algorithms
kck0 k
Connection between...
Sparse Approximation and Statistical Learning
Sparsity in statistical learning
=
X
responses
Xj = (X1j , . . . , XN j )T
predictor variables
Goal: Given X and y , find and coeffs. Rp for linear

model that minimizes the error
= arg min kX (y )k22 .
(
, )
Solution:
Least squares: low bias but large variance and hard to interpret
lots of non-zero coefficients
Shrink j s, make sparse.
Algorithms in statistical learning

Brute force: Calculate Mallows Cp for every subset of
predictor variables, and choose the best one.
Greedy algorithms: Forward selection, forward stagewise,
least angle regression (LARS), backward elimination.
Constrained optimization: Quadratic programming problem
with linear constraints (e.g., LASSO).
Unconstrained optimization: regularization techniques
Sparse approximation and SVM equivalence
[Girosi 96]
Connection between...
Sparse Approximation and Compressed Sensing

=
data/image
measurements/coeffs.
Interchange roles
Problem statement (TCS Perspective)
Assume x has
low complexity:
x is k-sparse
(with noise)
m as small
as possible
Construct
Matrix : Rn Rm
Decoding algorithm D
Given x for any signal x Rn , we can, with high probability,
quickly recover b
x with
kx b
x kp (1 + )
min
y ksparse
kx y kq = (1 + )kx xk kq
Comparison with Sparse Approximation
Sparse: Given y and , find (sparse) x such that y = x.

Return b
x with guarantee
kb
x y k2
small compared with ky xk k2 .
CS: Given y and , find (sparse) x such that y = x. Return

b
x with guarantee
kb
x xkp
small compared with kx xk kq .
p and q not always the same, not always = 2.
Comparison with Statistical Learning

N
=
X
responses
Xj = (X1j , . . . , XN j )
predictor variables
Goal: Given X and y , find and coeffs. Rp for linear

model that minimizes the error
= arg min kX (y )k22 .
(
, )
Statistics: X drawn iid from distribution (i.e., cheap
generation), characterize mathmematical performance as a
function of distribution
TCS: X (or distribution) is constructed (i.e., expensive),
characterize algorithmic performance as a function of space,
time, and randomness
Analogy: root-finding
p
!
p! with
with
|!
p p| !
|f (!
p) 0| !
root = p
Sparse: Given f (and y = 0), find p such that f (p) = 0.

Return b
p with guarantee
|f (b
p ) 0| small.
CS: Given f (and y = 0), find p such that f (p) = 0. Return

b with guarantee
p
|b
p p| small.
Parameters
1. Number of measurements m
2. Recovery time
3. Approximation guarantee (norms, mixed)
4. One matrix vs. distribution over matrices
5. Explicit construction
6. Universal matrix (for any basis, after measuring)
7. Tolerance to measurement noise
Applications
Applications
Data stream algorithms

xi = number of items with index i
* Data stream algorithms
xi = number of
items
with index
i
Ax under
can
maintain
increments to x and recover
can maintain x
under increments
approximation
to x to x
recover approximation to x
Efficient data sensing

* Efficient data sensing
cameras
digital/analog digital
cameras
analog-to-digital
converters
analog-to-digital
converters
high throughput biological screening
Error-correcting codes
(pooling designs)
code {y Rn | Ay = 0}
* Error-correcting codes
code {y Rn |y
0} vector, Ax = syndrome
x ==
error
x = error vector, x = syndrome
Two approaches
Geometric
[Donoho 04],[Candes-Tao 04, 06],[Candes-Romberg-Tao 05],
[Rudelson-Vershynin 06], [Cohen-Dahmen-DeVore 06], and many others...
Dense recovery matrices that satisfy RIP (e.g., Gaussian,

Fourier)
Geometric recovery methods (`1 minimization, LP)
b
x = argminkzk1 s.t. z = x
Uniform guarantee: one matrix A that works for all x
Combinatorial
[Gilbert-Guha-Indyk-Kotidis-Muthukrishnan-Strauss 02],
[Charikar-Chen-FarachColton 02] [Cormode-Muthukrishnan 04],

[Gilbert-Strauss-Tropp-Vershynin 06, 07]
Sparse random matrices (typically)

Combinatorial recovery methods or weak, greedy algorithms
Per-instance guarantees, later uniform guarantees
Summary
* Sparse approximation, statistical learning, and compressive

sensing intimately related
* Many models of computation and scientific/technological
problems in which they all arise
* Algorithms for all similar: optimization and greedy
* Community progress on geometric and statistical models for
matrices and signals x, different problem instance types
* Explicit constructions?
* Better/different geometric/statistical models?
* Better connections with coding and complexity theory?

Gilbert LDHD Tutorial Sept2013

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Gilbert LDHD Tutorial Sept2013

Hochgeladen von

Copyright:

Verfügbare Formate

An Introduction to Sparse Approximation

Basic image/signal/data compression: transform coding

Approximate signals sparsely

Compress images, signals, data

...especially for images

If {` } are linearly dependent, the dictionary is redundant

i.e., find a sparsest representation of x over .

i.e., find a sparsest approximation of x that achieves error .

i.e., find the best approximation of x using k atoms.

Exact Cover by 3-sets: X3C

Bad news, Good news

Bad news, Good news

Hardness depends on instance

example: spikes and sines

Sparse algorithms: exploit geometry

Orthogonal case: pull off atoms one at a time, with dot

Sparse algorithms: exploit geometry

Orthogonal case: pull off atoms one at a time, with dot

Large, incoherent dictionaries

wavelet packets, d = n log n, =

There are large dictionaries with coherence close to the lower

(Welch) bound; e.g., Kerdock codes, d = n2 , = 1/ n

Build approximation one step at a time...

Orthogonal Matching Pursuit OMP

Input. Dictionary , signal x, steps k

3. Iterate. t t + 1, stop when t > k.

Many greedy algorithms with similar outline

Matching Pursuit: replace step 2. by

Alternate stopping rules:

Many other variations

Even if x can be expressed sparsely, OMP may take n steps

Exact Recovery Condition and coherence

For most redundant dictionaries, k < 12 ( n + 1).

Sparse representation with OMP

Sufficient to find When can OMP do so?

Define greedy selection ratio

OMP chooses good atom iff (r ) < 1

where xk is the best k-term approximation to x.

Two-phase greedy pursuit produces b

Assume k 1 . Two-phase greedy pursuit produces b

[Gilbert, Strauss, Muthukrishnan, Tropp03]

Alternative algorithmic approach

Alternative algorithmic approach

Convex relaxation of non-convex problem

Error: non-convex optimization

Alternative algorithmic approach

Convex relaxation of non-convex problem

Error: non-convex optimization

Convex relaxation of non-convex problem

Convex relaxation: algorithmic formulation

Well-studied algorithmic formulation

[Donoho, Donoho-Elad-Temlyakov, Tropp,

and many others]

Optimization problem = linear program: linear objective

Exact Recovery Condition

Exact Recovery Condition

The ERC holds whenever k < 12 (1 + 1). Therefore, BP can

Alternate optimization formulations

Many algorithms for `1 -regularization

Sparse approximation: Optimization vs. Greedy

Exact and Error amenable to convex relaxation and

Sparse not amenable to convex relaxation

but appropriate for greedy algorithms

Sparse Approximation and Statistical Learning

i.e., find a sparsest approximation of x that achieves error .