Sparsistency of ' - Regularized M - Estimators

Sparsistency of `1 -Regularized M -Estimators
Yen-Huan Li Jonathan Scarlett Pradeep Ravikumar Volkan Cevher

LIONS, EPFL LIONS, EPFL University of Texas at Austin LIONS, EPFL
Abstract Wainwright, 2009, Zhao and Yu, 2006]. While sparsis-

tency results have been obtained for `1 -regularized M -
We consider the model selection consis- estimators on some specific non-linear models such as
tency or sparsistency of a broad set of `1 - logistic regression and Gaussian Markov random field
regularized M -estimators for linear and non- models [Bach, 2010, Bunea, 2008, Lam and Fan, 2009,
linear statistical models in a unified fash- Meinshausen and Bühlmann, 2006,
ion. For this purpose, we propose the lo- Ravikumar et al., 2010, Ravikumar et al., 2011],
cal structured smoothness condition (LSSC) general techniques with broad applicability are
on the loss function. We provide a general largely lacking.
result giving deterministic sufficient condi- Performing a general sparsistency analysis requires
tions for sparsistency in terms of the regular- the identification of general properties of statisti-
ization parameter, ambient dimension, spar- cal models, and their corresponding M -estimators,
sity level, and number of measurements. We that can be exploited to obtain strong performance
show that several important statistical mod- guarantees. In this paper, we introduce the local
els have M -estimators that indeed satisfy the structured smoothness condition (LSSC) condition
LSSC, and as a result, the sparsistency guar- (Definition 3.1), which controls the smoothness of
antees for the corresponding `1 -regularized the objective function in a particular structured
M -estimators can be derived as simple ap- set. We illustrate how the LSSC enables us to
plications of our main theorem. address a broad set of sparsistency results in a
unified fashion, including logistic regression, gamma
regression, and graph selection. We explicitly check
1 Introduction the LSSC for these statistical models, and as in pre-
vious works [Fan and Lv, 2011, Fan and Peng, 2004,
This paper studies the class of `1 -regularized M - Ravikumar et al., 2010, Ravikumar et al., 2011,
estimators for sparse high-dimensional estimation Wainwright, 2009, Zhao and Yu, 2006], we derive
[Bühlmann and van de Geer, 2011]. A key motivation sample complexity bounds for the high-dimensional
for adopting such estimators is sparse model selection, setting, where the ambient dimension and sparsity
that is, selecting the important subset of entries of a level are allowed to scale with the number of samples.
high-dimensional parameter based on random observa-
To the best of our knowledge, the first work to study
tions. We study the conditions for the reliable recovery
the sparsistency of a broad class of models was that
of the sparsity pattern, commonly known as model se-
of [Fan and Lv, 2011] for generalized linear models;
lection consistency or sparsistency.
however, the technical assumptions therein appear to
For the specific case of sparse linear regression, be difficult to check for specific models, thus mak-
the `1 -regularized least squares estimator has re- ing their application difficult. Another related work
ceived considerable attention. With respect to is [Lee et al., 2014]; in Section 7, we compare the two,
sparsistency, results have been obtained for both and discuss a key advantage of our approach.
the noiseless case (e.g., [Candes and Tao, 2005,
The paper is organized as follows. We specify the prob-
Donoho, 2006, Donoho and Tanner, 2005]) and
lem setup in Section 2. We introduce the LSSC in Sec-
the noisy case [Meinshausen and Bühlmann, 2006,
tion 3, and give several examples of functions satisfying
the LSSC in Section 4. In Section 5, we present the
Appearing in Proceedings of the 18th International Con-
ference on Artificial Intelligence and Statistics (AISTATS) main theorem of this paper, namely, sufficient condi-
2015, San Diego, CA, USA. JMLR: W&CP volume 38. tions for an `1 -regularized M -estimator to successfully
Copyright 2015 by the authors. recover the support. Sparsistency results for four dif-
644
ferent statistical models are established in Section 6 as assumptions analogous to those used for linear
corollaries of our main result. In Section 7, we present models (see [Wainwright, 2009]) hold true, then the
further discussions of our results, and list some direc- `1 -regularized M -estimator β̂n in (1) is sparsistent
tions for future research. The proofs of our results can under suitable conditions on the regularization pa-
be found in the supplementary material. rameter τn and the triplet (n, p, s). We allow for
the case of diverging dimensions [Fan and Lv, 2011,
2 Problem Setup Ravikumar et al., 2010, Ravikumar et al., 2011,
Wainwright, 2009, Zhao and Yu, 2006], where p grows
We consider a general statistical modeling setting exponentially with n.
where we are given n independent samples {yi }ni=1
drawn from some distribution P with a sparse param- Notations and Basic Definitions
eter β ∗ := β(P) ∈ Rp that has at most s non-zero
entries. We are interested in estimating this sparse pa- Fix v ∈ Rp , and let P = {1, . . . , p}. For any S ⊆
rameter β ∗ given the n samples via an `1 -regularized P, the notation vS denotes the sub-vector of v on S,
M -estimator of the form and the notation vS c denotes the sub-vector vP\S . For
i ∈ P, the notation vi denotes v{i} . We denote the
β̂n := arg minp Ln (β) + τn kβk1 , (1)
β∈R support set of v by supp v, defined as supp v = {i :
vi 6= 0, i ∈ P}. The notation sign v denotes the vector
where Ln is some convex function, and τn > 0 is a −1
(sign v1 , . . . , sign vp ), where sign vi = vi |vi | if vi 6= 0,
regularization parameter.
and sign vi = 0 otherwise, for all i ∈ P. We denote the
We mention here a special case of this model that transpose of v by v T , and the `q -norm of v by kvkq for
has broad applications in machine learning. For fixed qP∈ [1, +∞]. For u, v ∈ Rp , the notation hu, vi denotes
p
vectors x1 , . . . , xn in Rp , suppose that we are given i=1 ui vi .
realizations y1 , . . . , yn of independent random vari-
For A ∈ Rp×p , the notations AS,S , AS c ,S , supp A,
ables Y1 , . . . , Yn in R. We assume that each Yi fol-
sign A, and AT are defined analogously to the vector
lows a probability distribution Pθi parametrized only
case. The notation kAkq denotes the operator norm
by θi , where θi := hxi , β ∗ i for some sparse parame-
induced by the vector `q -norm; in particular, kAk2 de-
ter β ∗ ∈ Rp . Then it is natural to consider the `1 -
notes the spectral norm of A.
regularized maximum-likelihood estimator
n Let X be a real-valued random variable. We denote
1X the expectation and variance of X by E X and var X,
β̂n := arg minp `(yi ; β, xi ) + τn kβk1 ,
β∈R n respectively. The probability of an event E is denoted
i=1
by P E.
where ` denotes the negative log-likelihood at yi given
xi Pand β. Thus, we obtain (1) with Ln (β) := Let f be a vector-valued function with domain
1 n
i=1 `(yi ; β, xi ).
dom f ⊆ Rp . The notations ∇f and ∇2 f denote the
n
gradient and Hessian mapping of f , respectively. The
There are of course many other examples; to name notation f ∈ C k (dom f ) means that f is k-times con-
one other, we mention the graphical learning prob- tinuously differentiable on dom f . For a given function
lem, where we want to learn a sparse concen- f ∈ C k (dom f ), its k-th order Fréchet derivative at
tration matrix of a vector-valued random variable. x ∈ dom f is denoted by Dk f (x), which is a multilinear
In this setting, we also arrive at the formulation symmetric form [Zeidler, 1995]. The following special
(1), where Ln is the negative log-likelihood of the cases summarize how to compute all of the quantities
data [Ravikumar et al., 2011]. related to the Fréchet derivative in this paper:
We focus on the sparsistency of β̂n ; roughly speaking,
an estimator β̂n is sparsistent if it recovers the sup- 1. The first order Fréchet derivative is simply
port of β ∗ with high probability when the number of the gradient mapping; therefore, Df (x)[u] =
samples n is large enough. h∇f (x), ui for all u ∈ Rp .
Definition 2.1 (Sparsistency). A sequence of estima-
tors {β̂n }∞
n=1 is called sparsistent if 2. The second order Fréchet derivative is

the Hessian
n o mapping; therefore, D2 f (x)[u, v] = u, ∇2 f (x)v
lim P supp β̂n 6= supp β ∗ = 0. for all u, v ∈ Rp .
n→∞
The main result of this paper is that, if the function 3. The third order Fréchet derivative is defined as
L is convex and satisfies the LSSC, and certain follows. We first define the 2-linear form (matrix)
645
Yen-Huan Li, Jonathan Scarlett, Pradeep Ravikumar, Volkan Cevher
∇2 f (x+tu)−∇2 f (x)
D3 f (x)[u] := limt→0 t . Then We conclude this section by briefly discussing
the connection of the LSSC with other condi-
D3 f (x)[u, v, w] = D3 f (x)[u] [v, w]

tions. The following result, Proposition 9.1.1 of
= v, (D3 f (x)[u])w . [Nesterov and Nemirovskii, 1994], will be useful here

and throughout the paper.
We then define the 1-linear form (vector)
Proposition 3.3. Let A be a 3-linear symmetric form
D3 f (x)[u, v] to be the unique vector such that 3
on (Rp ) , and B be a positive-semidefinite 2-linear
hD3 f (x)[u, v], wi = D3 f (x)[u, v, w] for all vectors 2
symmetric form on (Rp ) . If
w in Rp .
4. When the arguments are thesame, we simply have |A[u, u, u]| ≤ B[u, u]3/2
k
Dk f (x)[u, . . . , u] = d dt
φu (t)
k , where φu (t) := for all u ∈ Rp , then
t=0
f (x + tu).
|A[u, v, w]| ≤ B[u, u]1/2 B[v, v]1/2 B[w, w]1/2
3 Local Structured Smoothness for all u, v, w ∈ Rp .

Condition This proposition shows that the condition in (2) with-
out structural constraints on u and ej is equivalent to
The following definition provides the key property of the statement that
convex functions that will be exploited in the subse-
D f (x∗ + δ)[u, v, w] ≤ K kuk kvk kwk
3
quent sparsistency analysis. 2 2 2 (3)
Definition 3.1 (Local Structured Smoothness Condi-
for all u, v, w ∈ Rp . In the supplementary material, we
tion (LSSC)). Consider a function f ∈ C 3 (dom f) with
show that (3) holds for all δ ∈ Rp such that x∗ + δ ∈
domain dom f ⊆ Rp . Fix x∗ ∈ dom f , and let Nx∗ be
Nx∗ if and only if
an open set in dom f containing x∗ . The function f
satisfies the (x∗ , Nx∗ )-LSSC with parameter K ≥ 0 if D f (x∗ + δ) − D2 f (x∗ ) ≤ K kδk ,
2
2 2 (4)
D f (x∗ + δ)[u, u] ≤ K kuk2 ,
3
∞ 2 for all δ ∈ Rp such that x∗ + δ ∈ Nx∗ . The latter
condition is simply the local Lipschitz continuity of the
for all δ ∈ Rp such that x∗ + δ ∈ Nx∗ , and for all
Hessian of f . This is why we consider our condition a
u ∈ Rp such that uS c = 0, where S := supp x∗ .
local structured smoothness condition, with structural
Note that D3 f (x∗ + δ)[u, u] is a 1-linear form, so k · k∞ constraints on the inputs of the D3 f (x∗ + δ) operator.
in Definition 3.1 is the vector `∞ -norm. The following The preceding observations reveal that (3), or the
equivalent characterization follows immediately. equivalent formulation (4), is more restrictive than the
Proposition 3.1. The function f satisfies the LSSC. That is, (3) implies the LSSC, while the reverse
(x∗ , Nx∗ )-LSSC with parameter K ≥ 0 if and only if is not true in general.
D f (x∗ + δ)[u, u, ej ] ≤ K kuk2 ,
3
2 (2)
4 Examples
for all δ ∈ Rp such that x∗ + δ ∈ Nx∗ , for all u ∈ Rp
such that uS c = 0, where S := supp x∗ , and for all In this section, we provide some examples of functions
j ∈ {1, . . . , p}, where ej is the standard basis vector that satisfy the LSSC.
with 1 in the j-th position and 0s elsewhere. 2
Example 4.1. Suppose that f (β) := ky − Xβk2 for
some fixed y ∈ Rp and X ∈ Rn×p . Since D3 f (β) ≡ 0
As we will see in the next section, this equivalent char-
everywhere, the function f satisfies the (β ∗ , Nβ ∗ )-
acterization is useful when verifying the LSSC for a
LSSC with parameter K = 0 for any β ∗ ∈ Rp and
given M -estimator.
any open set Nβ ∗ ⊆ Rp that contains β ∗ . This func-
Since differentiation is a linear operator, the LSSC is tion appears in the negative-likelihood in the Gaussian
preserved under linear combinations with positive co- regression model.
efficients, as is stated formally in the following lemma. Example 4.2. Let f (β) := hx, βi − ln hx, βi for some
Lemma 3.2. Let f1 satisfy the (x, N1 )-LSSC with pa- fixed x ∈ Rp . We show that, for any fixed β ∗ ∈ dom f
rameter K1 , and f2 satisfy the (x, N2 )-LSSC with pa- such that βS∗ c = 0, there exists some non-negative
rameter K2 . Let α and β be two positive real num- K and some open set Nβ ∗ such that f satisfies the
bers. The function f := αf1 +βf2 satisfies the (x, Nx )- (β ∗ , Nβ ∗ )-LSSC with parameter K. This function ap-
LSSC with parameter K, where Nx := N1 ∩ N2 , and pears in the negative log-likelihood in gamma regres-
K := αK1 + βK2 . sion with the canonical link function.
646
By a direct differentiation, we obtain for all u ∈ Rp It is already known that f is standard self-concordant
that [Nesterov, 2004]; that is,
D f (β ∗ + δ)[u, u, u]
3
D f (Θ∗ + ∆)[U, U, U ] ≤ 2 D2 f (Θ∗ + ∆)[U, U ] 3/2 ,
3
−3 2 3/2
= 2 |1 + γ| D f (β ∗ )[u, u] , (5)
for all U ∈ Rp×p and all ∆ ∈ Rp×p such that Θ∗ + ∆ ∈
dom f . This implies, by Proposition 3.3,
where
hx, δi D f (Θ∗ + ∆)[U, U, V ] ≤ 2 D2 f (Θ∗ + ∆)[U, U ]
3
γ := ,
hx, β ∗ i 1/2
D f (Θ∗ + ∆)[V, V ]
2
Combining this with Proposition 3.3, we have for each ,
standard basis vector ej that
for all U, V ∈ Rp×p , and all ∆ ∈ Rp×p such that Θ∗ +
D f (β ∗ + δ)[u, u, ej ]
3 ∆ ∈ dom f .
−3 1/2 Moreover, by a direct differentiation,
≤ 2 |1 + γ| D2 f (β ∗ )[u, u] D2 f (β ∗ )[ej , ej ]

D f (Θ∗ + ∆) = (Θ∗ + ∆)−1 ⊗ (Θ∗ + ∆)−1

1/2 2
−3
≤ 2 (1 − |γ|) D2 f (β ∗ )[u, u] D2 f (β ∗ )[ej , ej ]

, 2 2
2
∗ −1
if |γ| ≤ 1. Now define S := supp β ∗ , and suppose that = (Θ + ∆) .
2
uS c = δS c = 0, and that
Fix a positive constant κ, and suppose that we choose
hx, β ∗ i ∆ such that k∆kF ≤ (1+κ)−1 ρmin , where ρmin denotes
kδk2 ≤ the smallest eigenvalue of Θ∗ . Since k∆k2 ≤ k∆kF , it
(1 + κ) kxS k2
follows that k∆k2 ≤ (1 + κ)−1 ρmin , and, by Weyl’s
for some κ > 0. By the Cauchy-Schwartz inequality, theorem [Horn and Johnson, 1985],
it immediately follows that |γ| ≤ (1 + κ)−1 < 1, and κ
thus β ∗ + δ is in dom f . Moreover, using this bound ∗ −1
(Θ + ∆) ≥ ρmin .
on |γ|, we can further upper bound |D3 f | as 2 1+κ
D f (β ∗ + δ)[u, u, ej ] ≤ 2 1 + κ−1 3 λmax d1/2 2 Combining the preceding observations, it follows that
3
max kuk2 ,
f satisfies the (Θ∗ , NΘ∗ )-LSSC with parameter K :=
where λmax is the maximum restricted eigenvalue of 2κ−3 (1 + κ)3 ρ−3
min , where
D2 f (β ∗ ) defined as n 1
NΘ∗ = Θ∗ + ∆ : k∆kF < ρmin ,
2
λmax := sup D f (β )[u, u], ∗ 1+κ
o
kuk2 ≤1
uS c =0 ∆ = ∆T , ∆ ∈ Rp×p .
and dmax denotes the maximum diagonal entry of Here we have not exploited the special structure of U in
∇2 f (β ∗ ). Therefore, f satisfies the (β ∗ , Nβ ∗ )-LSSC Definition 3.1 (namely, uS c = 0), though conceivably
1/2
with parameter K := 2(1 + κ−1 )3 λmax dmax , where the constant K could improve by doing so. Note that
NΘ∗ ⊂ dom f and NΘ∗ is convex.
hx, β ∗ i

Nβ ∗ := β ∗ + δ : kδk2 ≤ , δ ∈ Rp .
(1 + κ) kxS k2
5 Deterministic Sufficient Conditions
Example 4.3. Consider the function f (Θ) =
Tr (XΘ) − ln det Θ with a fixed X ∈ Rp×p , and with We are now in a position to state the main result of this
dom f := {Θ ∈ Rp×p : Θ > 0}. We show that, for paper, whose proof can be found in the supplementary
any fixed Θ∗ ∈ dom f , there exists some non-negative material.
K and some open set NΘ∗ such that f satisfies the Let β ∗ ∈ Rp be the true parameter, and let S =
(Θ∗ , NΘ∗ )-LSSC with parameter K. This function ap- {i : (β ∗ )i 6= 0} be its support set. Define the “genie-
pears as the negative log-likelihood in the Gaussian aided” estimator with exact support information:
graphical learning problem.
β̌n ∈ arg min Ln (β) + τn kβk1 . (6)
Note that the previous definitions (in particular, Def- β∈Rp :βS c =0
inition 3.1), should be interpreted here as being taken
with respect to the vectorizations of the relevant ma- Theorem 5.1. Suppose that β̌n is uniquely defined.
trices. Then the `1 -regularized estimator β̂n defined in (1)
647
uniquely exists, successfully recovers the sign pattern, for the given loss function Ln . Whether the upper
i.e., sign β̂n = sign β ∗ , and satisfies the error bound bound on k∇Ln (β ∗ )k∞ holds depends on the concen-
tration of measure behavior of ∇Ln (β ∗ ), which usu-
α + 4√
β̂n − β ∗ ≤ rn := sτn , (7) ally concentrates around 0. In the next section, we

2 λmin will give concrete examples for the high-dimensional
if the following conditions hold true. setting, where p and s scale with n.
Of course, sign β̂n = sign β ∗ implies that supp β̂n =
1. (Local structured smoothness condition) Ln is supp β ∗ , i.e. successful support recovery.
convex, three times continuously differentiable,
and satisfies the (β ∗ , Nβ ∗ )-LSSC with parameter 6 Applications
K ≥ 0, for some convex Nβ ∗ ⊆ dom Ln .
2. (Positive definite restricted Hessian) The In this section, we provide several applications of The-
re-
stricted Hessian at β ∗ satisfies ∇2 Ln (β ∗ ) S,S ≥ orem 5.1, presenting concrete bounds on the sample

λmin I for some λmin > 0. complexity in each case. We defer the full proofs of the
results in this section to the supplementary material.
3. (Irrepresentablility condition) For some α ∈ However, in each case, we present here the most im-
(0, 1], it holds that portant step of the proof, namely, verifying the LSSC.
−1 Note that instead of the classical setting where
∇ Ln (β ∗ ) S c ,S ∇2 Ln (β ∗ ) S,S < 1−α. (8)
2
∞ only the sample size n increases, we consider the
high-dimensional setting, where the ambient di-
4. (Beta-min condition) The smallest non-zero entry mension p and the sparsity level s are allowed to
of β satisfies grow with n [Fan and Lv, 2011, Fan and Peng, 2004,
Ravikumar et al., 2010, Ravikumar et al., 2011,
βmin := min {|(β ∗ )k | : k ∈ S} > rn , (9) Wainwright, 2009, Zhao and Yu, 2006].
where rn is defined in (7).
6.1 Linear Regression
5. The regularization parameter τn satisfies
We first consider the linear regression model with ad-
λ2min α ditive sub-Gaussian noise. This setting trivially fits
τn < 2 Ks . (10) into our theoretical framework.
4 (α + 4)
Definition 6.1 (Sub-Gaussian Random Variables).
6. The gradient of Ln at β ∗ satisfies A zero-mean real-valued random variable Z is sub-
α Gaussian with parameter c > 0 if
k∇Ln (β ∗ )k∞ ≤ τn . (11)
4 2 2
c t
E exp(tZ) ≤ exp
7. The relation Brn ⊆ Nβ ∗ holds, where 2
for all t ∈ R.
Brn := {β ∈ Rp : kβn − β ∗ k2 ≤ rn , βS c = 0}
Let Xn := {x1 , . . . , xn } ⊂ Rn be given. Define the
and rn is defined in (7).
matrix Xn ∈ Rn×p such that the i-th row of Xn is
xi . We assume that the elements in Xn are normal-
As mentioned previously, the first condition is the ized such that
key assumption permitting us to perform a gen- √ each column of X has `2 -norm less than
or equal to n. Let W1 , . . . , Wn be independent sub-
eral analysis. The second, third, and forth assump- Gaussian random variables with parameter c, and de-
tions are analogous to those appearing in the lit- fine Yi := hxi , β ∗ i + Wi .
erature for sparse linear regression. We refer to
[Bühlmann and van de Geer, 2011] for a systematic We consider the `1 -regularized M -estimator of the
discussion of these conditions.1 form (1), with
The remaining conditions determine the interplay be- 1X1
n
2
tween τn , n, p, and s. Whether the relation Brn ⊆ Nβ ∗ Ln (β) = (Yi − hxi , βi) .
n i=1 2
holds depends on the specific Nβ ∗ that one can derive
1
Equation (8) is sometimes called the incoherence con- As shown in the first example of Section 4, Ln satis-
dition [Wainwright, 2009]. fies the LSSC with parameter K = 0 everywhere in
648
Rp . Therefore, the condition on τn in (10) is trivially Define ì (β) = ln [1 + exp (−(2yi − 1) hxi , βi)]. The
satisfied, as is the final condition listed in the theorem. cases yi = 0 and yi = 1 are handled similarly, so we
focus on the latter. A direct differentiation yields the
By a direct calculation, we have
following (this is most easily verified for u = v):
n
1X
∇Ln (β ∗ ) = (Yi − E Yi )xi . |D3 ì (β ∗ + δ)[u, u, v]|
n i=1
|1 − exp (− hxi , β ∗ + δi)|
By the union bound and the standard concentra- = |hxi , vi| D2 ì (β ∗ + δ)[u, u]
1 + exp (− hxi , β ∗ + δi)
tion inequality for sub-Gaussian random variables
≤ |hxi , vi| D2 ì (β ∗ + δ)[u, u],
[Boucheron et al., 2013],
n ατn o and
P k∇Ln (β ∗ )k∞ ≥
4 2
p
ατn o exp (− hxi , βi) hxi , ui
D2 ì (β)[u, u] =
X n
≤ P |[∇Ln (β ∗ )]i | ≥ [1 + exp (− hxi , βi)]
2
i=1
4
1 2
≤ 2p exp −cnt2 t= ατn . ≤ hxi , ui

4 4
Since [D2 Ln (β)]S,S = [D2 Ln (β ∗ )]S,S is positive defi- for all β ∈ Rp . The last inequality follows since the
z 1
function (1+z) 2 has a maximum value of 4 for z ≥ 0.
nite for all β ∈ Rp by the second assumption of Theo-
rem 5.1, β̌n uniquely exists, and Theorem 5.1 is appli- It follows that
cable. By choosing τn sufficiently large that the above 1 2
bound decays to zero, we obtain the following. |D3 ì (β ∗ + δ)[u, u, v]| ≤ |hxi , vi| |hxi , ui|
4
Corollary 6.1. For the linear regression problem de- 1 2 3
≤ k(xi )S k2 kxi k∞ kuk2 ,
scribed above, suppose that assumptions 2 to 4 of The- 4
orem 5.1 hold for some λmin and α bounded away
from zero.2 If s log p n, and we choose τn for any u ∈ Rp such that uS c = 0, and for any v equal
(n−1 log p)1/2 , then the `1 -regularized maximum like- to some standard basis vector ej . Hence, Ln satisfies
lihood estimator is sparsistent. the (β ∗ , Nβ ∗ )-LSSC with parameter K = (1/4)νn2 γn ,
where
Observe that this recovers the scaling law given in
[Wainwright, 2009] for the linear regression model. νn := max k(xi )S k2 ,
i
γn := max kxi k∞ ,
6.2 Logistic Regression i
n
Let Xn := {x1 , . . . , xn } ⊂ RP be given. As in Sec- and where Nβ ∗ can be any fixed open convex neigh-
n 2 borhood of β ∗ in Rp .
tion 6.1, we assume that j=1 (xi )j ≤ n for all
i ∈ {1, . . . , p}. Corollary 6.2. For the logistic regression problem de-
∗ p
Let β ∈ R be sparse, and define S := supp β . ∗ scribed above, suppose that assumptions 2 to 4 of The-
We are interested in estimating β ∗ given Xn and orem 5.1 hold for some λmin and α bounded away from
Yn := {y1 , . . . , yn }, where each yi is the realization zero. If we choose τn (n−1 log p)1/2 , and s and p
of a Bernoulli random variable Yi with such that s2 (log p) νn4 γn2 n, then the `1 -regularized
maximum-likelihood estimator is sparsistent.
1
P {Yi = 1} = 1 − P {Yi = 0} = . √
1 + exp (− hxi , β ∗ i) In [Bunea, 2008], a scaling law of the form s (log nn)2
The random variables Y1 , . . . , Yn are assumed to be is given, but the result is restricted to the case that p
independent. grows polynomially with n. The result in [Bach, 2010]
yields the scaling s2 (log p)νn 2 n, where ν n :=
We consider the `1 -regularized maximum-likelihood max {kxi k2 }. It should be noted that ν n is gener-
estimator of the form (1) with ally significantly larger than νn and γn ; for example,
1X
n for i.i.d. Gaussian
√ vectors, these scale on average as
√
Ln (β) := ln {1 + exp [−(2Yi − 1) hxi , βi]} . O( p), O( s) and O(1), respectively. Our result re-
n i=1
covers the same dependence of n on s and p as that in
2
For all of the examples in this section, these assump- [Bach, 2010], but removes the dependence on ν n . Of
tions are independent of the data, and we can thus talk course, we do not restrict p to grow polynomially with
about them being satisfied deterministically. n.
649
6.3 Gamma Regression 2 to 4 of Theorem 5.1 hold for some √ λ−1

min , and α
bounded away from zero. If τn n log p and
n
Let Xn := {x
P1n, . . . , xn2} ⊂ R be given. We again
2
s2 (log p) µ−6 4 2
n νn γn n, then the `1 -regularized maxi-
assume that j=1 (xi )j ≤ n for all i ∈ {1, . . . , p}. mum likelihood estimator is sparsistent.
Let β ∗ ∈ Rp be sparse, and define S := supp β ∗ .
To the best of our knowledge, this is the first sparsis-
We are interested in estimating β ∗ given Xn and
tency result for gamma regression.
Yn := {y1 , . . . , yn }, where each yi is the realization
of a gamma random variable Yi with known shape pa-
rameter k > 0 and unknown scale parameter θi = 6.4 Graphical Model Learning
−1
k −1 hxi , β ∗ i . The corresponding density function is
1
y
k−1 − θii Let Θ∗ ∈ Rp×p be a positive-definite matrix. We as-
of the form Γ(k)θ k yi e . sume there are at most s non-zero entries in Θ∗ , and
i
We assume that let S denote its support set. Let X1 , . . . , Xn be inde-

pendent p-dimensional random vectors generated ac-
hxi , β ∗ i ≥ µn ∀i ∈ {1, . . . , n} (12) cording to a common distribution with mean zero and
−1
covariance matrix Σ∗ := (Θ∗ ) . We are interested in
for some µn > 0, so θi is always well-defined. More- recovering the support of Θ∗ given X1 , . . . , Xn .
over, the random variables Y1 , . . . , Yn are assumed to −1/2
be independent. We assume that each (Σi,i ) Xi,i is sub-Gaussian
with parameter c > 0, and that Σi,i is bounded above
We consider the `1 -regularized maximum-likelihood by a constant κΣ∗ , for all i ∈ {1, . . . , p}. Let ρmin
estimator of the form (1) with denote the smallest eigenvalue of Θ∗ .
n
1X We consider the `1 -regularized M -estimator of the
Ln (β) := [− ln hxi , βi + Yi hxi , βi] . form (1), given by
n i=1
Θ̂n := arg min Ln (Θ) + τn |Θ|1 : Θ > 0, Θ ∈ Rp×p .

Note that θi only enters the log-likelihood via constant Θ
terms not containing β; these have been omitted, as
they do not affect the estimation. P |Θ|1 denotes the entry-wise `1 -norm, i.e., |Θ|1 =
Here
Defining ì (β) = − ln hxi , βi + yi hxi , βi, we obtain the (i,j)∈{1,...,p}2 |Θi,j | and
following for all u ∈ Rp such that uS c = 0, using the

Cauchy-Schwartz inequality and (12): Ln (Θ) = Tr Σ̂n Θ − log det Θ,
2 2
hxi , ui k(xi )S k2 2 where Σ̂n := 1
Pn
Xi XiT is the sample covariance
D2 ì (β ∗ )[u, u] = 2 ≤ 2 kuk2 n i=1
hxi , β∗i hxi , β∗i matrix.
1 2 2
≤ 2 kuk2 k(xi )S k2 . Fix κ > 0. By Example 4.3, we know that Ln satisfies
µn the (Θ∗ , NΘ∗ )-LSSC with parameter 2κ−3 (1+κ)3 ρ−3
min ,
Thus, the largest restricted eigenvalue of where
D2 ì (β ∗ ) is upper bounded by µ−2 2
n νn , where
1
νn = maxi {k(xi )S k2 }. Similarly, we obtain NΘ∗ := Θ∗ + ∆ : k∆kF < ρmin ,
1+κ
1 ∆ = ∆T , ∆ ∈ Rp×p ,

2
D2 ì (β ∗ )[ej , ej ] ≤ kxi k∞ ,
µ2n
where ρmin denotes the smallest eigenvalue of Θ∗ .
for any standard basis vector ej . Thus, the largest
diagonal entry of D2 ì (β ∗ ) is upper bounded by µ−2 2
n γn ,
The beta-min condition can be written as
where γn = maxi kxi k∞ .
min Θ∗i,j : Θ∗i,j 6= 0, (i, j) ∈ {1, . . . , p}2 > rn .

Fix κ > 0. By Example 4.2 and Lemma 3.2, Ln sat-
isfies the (β ∗ , Nβ ∗ )-LSSC with parameter K = 2(1 + We now have the following.
κ−1 )3 µ−3 2
n νn γn , and Corollary 6.4. Consider the graphical model selec-

µn
tion problem described above, and suppose the above
Nβ ∗ = β ∗ + δ : kδk2 < , δ ∈ Rp . assumptions and assumptions 2 to 4 of Theorem 5.1
(1 + κ)νn
hold for some c, κΣ∗ , ρmin , λmin , and α bounded away
Corollary 6.3. Consider the gamma regression prob- from zero. If τn (n−1 log p)1/2 and s2 log p n, the
lem as described above, and suppose that assumptions `1 -regularized M -estimator Θ̂n is sparsistent.
650
Corollary 6.4 is for graphical learning on general sparse [Boucheron et al., 2013] Boucheron, S., Lugosi, G.,
networks, as we only put a constraint on s. Sev- and Massart, P. (2013). Concentration Inequalities:
eral previous works have instead imposed structural A Nonasymptotic Theory of Independence. Oxford
constraints on the maximum degree of each node; Univ. Press, Oxford.
e.g. see [Ravikumar et al., 2011]. Since this model re-
quires additional structural assumptions beyond spar- [Bühlmann and van de Geer, 2011] Bühlmann, P.
sity alone, it is outside the scope of our theoretical and van de Geer, S. (2011). Statistics for High-
framework. Dimensional Data. Springer, Berlin.
[Bunea, 2008] Bunea, F. (2008). Honest variable selec-

7 Discussion tion in linear and logistic regression models via `1
and `1 + `2 penalization. Electron. J. Stat., 2:1153–
Our work bears some resemblance to the independent 1194.
work of [Lee et al., 2014]. The smoothness condition
therein is in fact the non-structured condition in (4). [Candes and Tao, 2005] Candes, E. and Tao, T.
From the discussion in Section 3, we see that our con- (2005). Decoding by linear programming. IEEE
dition is less restrictive. As a consequence, both anal- Trans. Inf. Theory, 51(12):4203–4215.
yses lead to scaling laws of the form n K 2 s2 (log p)γ [Donoho, 2006] Donoho, D. L. (2006). Compressed
for some γ > 0 for generalized linear models, but sensing. IEEE Trans. Inf. Theory, 52(4):1289–1306.
the corresponding definitions of K differ significantly.
Eliminating the dependence of K on p requires ad- [Donoho and Tanner, 2005] Donoho, D. L. and Tan-
ditional non-trivial extensions of the framework in ner, J. M. (2005). Neighborliness of randomly-
[Lee et al., 2014], whereas in our framework the de- projected simplices in high dimensions. Proc. Nat.
sired independence is immediate (e.g. see the logistic Acad. Sci., 102(27):9452–9457.
and gamma regression examples).
[Fan and Lv, 2011] Fan, J. and Lv, J. (2011). Noncon-
The derivation of estimation error bounds such as (7)
cave penalized likelihood with NP-dimensionality.
(as opposed to full sparsistency) usually only requires
IEEE Trans. Inf. Theory, 57(8):5467–5484.
some kind of local restricted strong convexity (RSC)
condition [Negahban et al., 2012] on Ln . It is inter- [Fan and Peng, 2004] Fan, J. and Peng, H. (2004).
esting to note that in this paper, it suffices for sparsis- Nonconcave penalized likelihood with a diverging
tency to assume only the LSSC and the positive defi- number of parameters. Ann. Stat., 32(3):928–961.
niteness of the restricted Hessian at the true parame-
ter. It would be interesting to derive connections be- [Horn and Johnson, 1985] Horn, R. A. and Johnson,
tween the LSSC and such local RSC conditions, which C. R. (1985). Matrix Analysis. Cambridge Univ.
in turn may shed light on whether the LSSC is neces- Press, Cambridge.
sary to derive sparsistency results, or whether a weaker
condition may suffice. [Lam and Fan, 2009] Lam, C. and Fan, J. (2009).
Sparsistency and rates of convergence in large co-
The framework presented here considers general sparse
variance matrix estimation. Ann. Stat., 37:4254–
parameters. It is of great theoretical and practical
4278.
importance to sharpen this framework for structured
sparse parameters, e.g., group sparsity, and graphical [Lee et al., 2014] Lee, J. D., Sun, Y., and Taylor,
model learning for networks with bounded degrees. J. E. (2014). On model selection consistency of M -
estimators with geometrically decomposable penal-
Acknowledgements ties. arXiv:1305.7477v7.
This work was supported in part by the European [Meinshausen and Bühlmann, 2006] Meinshausen, N.
Commission under Grant MIRG-268398, ERC Future and Bühlmann, P. (2006). High-dimensional graphs
Proof, SNF 200021-132548, SNF 200021-146750 and and variable selection with the Lasso. Ann. Stat.,
SNF CRSII2-147633. 34:1436–1462.
[Negahban et al., 2012] Negahban, S. N., Ravikumar,

References P., Wainwright, M. J., and Yu, B. (2012). A uni-
fied framework for high-dimensional analysis of M -
[Bach, 2010] Bach, F. (2010). Self-concordant analysis estimators with decomposable regularizers. Stat.
for logistic regression. Electron. J. Stat., 4:384–414. Sci., 27(4):538–557.
651
[Nesterov, 2004] Nesterov, Y. (2004). Introductory

Lectures on Convex Optimization. Kluwer, Boston,
MA.
[Nesterov and Nemirovskii, 1994] Nesterov, Y. and
Nemirovskii, A. (1994). Interior-Point Polyno-
mial Algorithms in Convex Programming. SIAM,
Philadelphia, PA.
[Ravikumar et al., 2010] Ravikumar, P., Wainwright,
M. J., and Lafferty, J. D. (2010). High-dimensional
Ising model selection using `1 -regularized logistic re-
gression. Ann. Stat., 38(3):1287–1319.
[Ravikumar et al., 2011] Ravikumar, P., Wainwright,
M. J., Raskutti, G., and Yu, B. (2011). High-
dimensional covariance estimation by minimizing
`1 -penalized log-determinant divergence. Electron.
J. Stat., 5:935–980.
[Wainwright, 2009] Wainwright, M. J. (2009). Sharp
thresholds for high-dimensional and noisy sparsity
recovery using `1 -constrained quadratic program-
ming (Lasso). IEEE Trans. Inf. Theory, 55(5):2183–
2202.
[Zeidler, 1995] Zeidler, E. (1995). Applied Functional
Analysis: Main Principles and Their Applications.
Springer-Verl., New York, NY.
[Zhao and Yu, 2006] Zhao, P. and Yu, B. (2006). On
model selection consistency of Lasso. J. Mach.
Learn. Res., 7:2541–2563.
652

Sparsistency of ' - Regularized M - Estimators

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Sparsistency of ' - Regularized M - Estimators

Hochgeladen von

Copyright:

Verfügbare Formate

Sparsistency of `1 -Regularized M -Estimators

Yen-Huan Li Jonathan Scarlett Pradeep Ravikumar Volkan Cevher

Abstract Wainwright, 2009, Zhao and Yu, 2006]. While sparsis-

3 Local Structured Smoothness for all u, v, w ∈ Rp .

D f (Θ∗ + ∆) = (Θ∗ + ∆)−1 ⊗ (Θ∗ + ∆)−1

6.3 Gamma Regression 2 to 4 of Theorem 5.1 hold for some √ λ−1

We assume that let S denote its support set. Let X1 , . . . , Xn be inde-

Θ̂n := arg min Ln (Θ) + τn |Θ|1 : Θ > 0, Θ ∈ Rp×p .

following for all u ∈ Rp such that uS c = 0, using the

[Bunea, 2008] Bunea, F. (2008). Honest variable selec-

[Negahban et al., 2012] Negahban, S. N., Ravikumar,

[Nesterov, 2004] Nesterov, Y. (2004). Introductory

Das könnte Ihnen auch gefallen

Sparsistency of ' - Regularized M - Estimators

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Sparsistency of ' - Regularized M - Estimators

Hochgeladen von

Copyright:

Verfügbare Formate

Sparsistency of `1 -Regularized M -Estimators

Yen-Huan Li Jonathan Scarlett Pradeep Ravikumar Volkan Cevher

Abstract Wainwright, 2009, Zhao and Yu, 2006]. While sparsis-

3 Local Structured Smoothness for all u, v, w ∈ Rp .

D f (Θ∗ + ∆) = (Θ∗ + ∆)−1 ⊗ (Θ∗ + ∆)−1

6.3 Gamma Regression 2 to 4 of Theorem 5.1 hold for some √ λ−1

We assume that let S denote its support set. Let X1 , . . . , Xn be inde-

Θ̂n := arg min Ln (Θ) + τn |Θ|1 : Θ > 0, Θ ∈ Rp×p .

following for all u ∈ Rp such that uS c = 0, using the  

[Bunea, 2008] Bunea, F. (2008). Honest variable selec-

[Negahban et al., 2012] Negahban, S. N., Ravikumar,

[Nesterov, 2004] Nesterov, Y. (2004). Introductory

Das könnte Ihnen auch gefallen

following for all u ∈ Rp such that uS c = 0, using the