Sie sind auf Seite 1von 37

Submitted to the Annals of Statistics

arXiv: math.PR/0000000

A WELL CONDITIONED AND SPARSE ESTIMATE OF


COVARIANCE AND INVERSE COVARIANCE MATRIX
USING A JOINT PENALTY
By Ashwini Maurya
Michigan State University
We develop a method for estimating a well conditioned and
sparse covariance matrix from a sample of vectors drawn from a subgaussian distribution in high dimensional setting. The proposed estimator is obtained by minimizing the squared loss function and joint
penalty of 1 norm and sum of squared deviation of the eigenvalues
from a positive constant. The joint penalty plays two important roles:
i) 1 penalty on each entry of covariance matrix reduces the eective
number of parameters and consequently the estimate is sparse and
ii) the sum of squared deviations penalty on the eigenvalues controls
the over-dispersion in the eigenvalues of sample covariance matrix.
In contrast to some of the existing methods of covariance matrix estimation, where often the interest is to estimate a sparse matrix, the
proposed method is exible in estimating both a sparse and wellconditioned covariance matrix simultaneously. We also extend the
method to inverse covariance matrix estimation and establish the consistency of the proposed estimators in both Frobenius and Operator
norm. The proposed algorithm of covariance and inverse covariance
matrix estimation is very fast, ecient and easily scalable to large
scale data analysis problems. The simulation studies for varying sample size and number of variables shows that the proposed estimator
performs better than graphical lasso, PDSCE estimates for various
choices of structured covariance and inverse covariance matrices. We
also use our proposed estimator for tumor tissues classication using
gene expression data and compare its performance with some other
classication methods.

1. Introduction. With the recent surge in data technology and storage


capacity, todays statisticians often encounter data sets where sample size n
is small and number of variables p is very large: often hundreds, thousands
and even million or more. Examples include gene expression data and web
search problems [Clarke et al. (2008), Pass et al. (2006)]. For many of the
high dimensional data problems, the choice of classical statistical methods
becomes inappropriate for making valid inference. The recent developments
AMS 2000 subject classifications: Primary 62G20,62G05; secondary 62H12.
Corresponding author: Maurya
Key words and phrases. Sparsity, Eigenvalue Penalty, Matrix Estimation, Penalized Estimation.

ASHWINI MAURYA

in asymptotic theory deal with increasing p as long as both p and n tend to


innity at some rate depending upon parameter of interest.
The estimation of covariance and inverse covariance matrix is a problem
of primary interest in multivariate statistical analysis. Some of the applications include: (i) Principal component analysis (PCA) [Johnstone et al.
(2004), Zou et al. (2006)]: where the goal is to project the data on best
k-dimensional subspace, where best means the projected data explains as
much of the variation in original data without increasing k. (ii) Discriminant analysis [Mardia et al. (1975)]: where the goal is to classify observations
into dierent classes, an estimate of covariance and inverse covariance matrix
plays an important role as the classier is often a function of these entities.
(iii) Regression analysis: If interest focuses on estimation of regression coefcients with correlated (or longitudinal) data, a sandwich estimator of the
covariance matrix may be used to provide standard errors for the estimated
coecients that are robust in the sense that they remain consistent under
mis-specication of the covariance structure. (iv) Gaussian graphical modeling [Meinshausen (2006), Wainwright et al. (2006), Yuan et al. (2007)]: the
relationship structure among nodes can be inferred from inverse covariance
matrix. A zero entry in the inverse covariance matrix implies conditional
independence between the corresponding nodes.
The estimation of large dimensional covariance matrix based on few sample observations is a dicult problem, especially when n p (here an bn
means that there exist positive constants c and C such that c an /bn C).
In these situations, the sample covariance matrix becomes unstable which
explodes the estimation error. It is well known that the eigenvalues of sample
covariance matrix are over-dispersed which means that the eigen-spectrum
of sample covariance matrix is not a good estimator of its population counterpart [Marcenko (1967), Karoui (2008)]. To illustrate this point, consider
p = Ip , so all the eigenvalues are 1. A result from [Geman S. (1980)]
shows that if entries of Xi s are i.i.d and have a nite fourth moment and if
p/n > 0, then the largest sample eigenvalue l1 satises:

a.s
l1 (1 + )2 ,
This suggests that l1 is not a consistent estimator of the largest eigenvalue
1 of population covariance matrix. In particular if n = p then l1 tends
to 4 whereas 1 is 1. This is also evident in the eigenvalue plot in gure
2.1. The distribution of l1 also depends upon the underlying structure of
the true covariance matrix. From gure 2.1, it is evident that the smaller
sample eigenvalues tend to underestimate the true eigenvalues for large p and

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION3

small n. For more discussion here see [Karoui (2008)]. To correct this bias,
a natural choice would be to shrink the sample eigenvalues towards some
suitable constant to reduce the over-dispersion. For instance, Stein (1975)
U
is a diagonal
=U
()
where ()
proposed an estimator of the form
matrix with diagonal entries as transformed function of sample eigenvalues
is matrix of eigenvectors. In another interesting paper Ledoit and Wolf
and U
(2004) proposed an estimator that shrinks the sample covariance matrix
towards the identity matrix. In another paper, Karoui (2008) proposed a
non-parametric estimation of spectrum of eigenvalues and show that his
estimator is consistent in sense of weak convergence of distributions.
The covariance matrix estimates based on eigen-spectrum shrinkage are
well conditioned in the sense that their eigenvalues are well bounded away
from zero. These estimates are based on the shrinkage of the eigenvalues and
therefore invariant under some orthogonal group i.e. the shrinkage estimators
shrink the eigenvalues but eigenvectors remain unchanged. In other words,
the basis (eigenvector) in which the data are given is not taken advantage
of and therefore the methods rely on premise that one will be able to nd
a good estimate in any basis. In particular, it is reasonable to believe that
the basis generating the data is somewhat nice. Often this translates into
the assumption that the covariance matrix has particular structure that one
should be able to take advantage of. In these situations, it becomes natural
to perform certain form of regularization directly on the entries of sample
covariance matrix.
Much of the recent literature focuses on two broad class of regularized covariance matrix estimation. i) The one class rely on natural ordering among
variables, where one often assumes that the variables far apart are weekly
correlated and ii) the other class where there is no assumption on the natural
ordering among variables. The rst class includes the estimators based on
banding and tapering [Bickel and Levina (2008), Cai et al. (2010)]. These
estimators are appropriate for a number of applications for ordered data
(time series, spectroscopy, climate data). However for many applications including gene expression data, priori knowledge of any canonical ordering is
not available and searching for all permutation of possible ordering would
not be feasible. In these situations, an 1 penalized estimator becomes more
appropriate which yields a permutation-invariant estimate.
To obtain a suitable estimate which is both well conditioned and sparse,
we introduce two regularization terms: i) 1 penalty to each of the odiagonal elements of matrix and, ii) squared deviation penalty to eigenvalues
from a suitable constant. The 1 minimization problems are well studied in
the covariance and inverse covariance matrix estimation literature [Freidman

ASHWINI MAURYA

et al. (2007), Banerjee et al. (2008), Bickel and Levina (2008), Ravikumar et
al. (2011), Jacob and Tibshirani (2011), Maurya (2014) etc.]. Meinshausen
and B
uhlmann (2006) studied the problem of variable selection using high
dimensional regression with lasso and show that it is a consistent selection
scheme for high dimensional graphs. Rothman et al. (2008) propose an 1 penalized log-likelihood estimator and show that their (estimator is consistent
)

in Frobenius and operator norm at the rate of OP


{(p + s) log p}/n ,
as both p and n approach to innity. Here s is the number of non-zero odiagonal elements in true covariance matrix. Jacob and Tibshirani (2011)
propose an estimator of covariance matrix as penalized maximum likelihood
estimator with a weighted lasso type penalty. In these optimization problems, the 1 penalty results in sparse (as compared to other lq , q = 1 penalties) and a permutation-invariant estimate as compared to other lq , q = 1
penalties. Another advantage is that the 1 norm is a convex function which
makes it suitable for large scale optimization problems and a number of fast
algorithms exist for covariance and inverse covariance matrix estimation
[(Freidman et al. (2007), Rothman (2012)]. The eigenvalue squared penalty
from a suitable constant overcomes the over-dispersion in the sample covariance matrix so that the estimator remains well conditioned.
Ledoit and Wolf (2004) proposed an estimator of covariance matrix as a
linear combination of sample covariance and identity matrix. Their estimator
of covariance matrix is well conditioned but it is not sparse. Rothman (2012)
proposed estimator of covariance matrix based on squared error penalty and
1 penalty with a log-barrier on the determinant of covariance matrix. The
log-determinant barrier is a valid technique to achieve positive deniteness
but it is still unclear whether the iterative procedure proposed in this paper [Rothman (2012)] actually nds the right solution to the corresponding
optimization problem. In another interesting paper, Xue et al. (2012) propose an estimator of covariance matrix as a minimizer of penalized squared
loss function over set of positive denite cones. In this paper, the authors
solve a positive denite constrained optimization problem and establish the
consistency of estimator. The resulting estimator is sparse and positive definite but whether it overcomes the over-dispersion of the eigen-spectrum
of sample covariance matrix, is hard to justify. Maurya (2014) proposed a
joint convex penalty as function of 1 and trace norm (dened as sum of
singular values of a matrix) for inverse covariance matrix estimation based
on penalized likelihood approach.
In this paper, we derive an explicit rate of convergence of the proposed
estimator (2.4) in Frobenius norm and operator norm. This rate depends
upon level of sparsity of the true covariance matrix. In addition, for a slight

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION5

modication of the method (Theorem 3.3), we prove the consistency of our


estimate in operator norm and show that its rate is similar to that of banded
estimator of Bickel and Levina (2008). One of the major advantage of the
proposed estimator is that the derived algorithm is very fast, ecient and
easily scalable to a large scale data analysis problem.
The rest of the paper is organized as following. The next section highlights
some background and problem set-up for covariance and inverse covariance
matrix estimation. In section 3, we give proposed estimator and establish
its theoretical consistency. In section 4, we give an algorithm and compare
its computational time with some other existing algorithms. Section 5 highlights the performance of proposed estimator on simulated data while an
application of proposed estimator to real life colon tumor data is given in
Section 6.
Notation: For a matrix M , let M 1 denote its 1 norm dened as the
sum of absolute values of the entries of matrix M , M F denote the Frobenius norm of matrix M dened as sum of squared element of M , M
denote the operator norm (also called spectral norm) dened as largest absolute eigenvalue of M , M denote matrix M where all diagonal elements
are set to zero, M + denote matrix M where all o-diagonal elements are
set to zero, i (M ) denote the ith largest eigenvalue of M , tr(M ) denotes its
trace and let det(M ) denote its determinant.
2. Background and Problem Set-up. Let X = (X1 , X2 , , Xp ) be
a zero-mean p-dimensional random vector. The focus of this paper is the
estimation of the covariance matrix := E(XX T ) and its inverse 1 from
a sample of independently and identically distributed data {X (k) }nk=1 . In
this section we provide some background and problem setup more precisely.
The choice of loss function is very crucial in any optimization problem.
An optimal estimator for a particular loss function may not be optimal for
another choice of loss function. Recent literature in covariance matrix and
inverse covariance matrix estimation mostly focus on estimation based on
likelihood function or quadratic loss function [Freidman et al. (2007), Banerjee et al. (2008), Bickel and Levina (2008), Ravikumar et al. (2011), Rothman (2012), Maurya (2014) etc.]. The maximum likelihood estimation requires a tractable probability distribution of observations whereas quadratic
loss function does not have any such requirement and therefore fully nonparametric. The quadratic loss function is convex and due to this analytical
tractability, it is a widely applicable choice for many data analysis problem.

ASHWINI MAURYA

2.1. Proposed Estimator. Let S be the sample covariance matrix. Consider the following optimization problem.
(2.1)

, = arg min

=T

p
[
]

|| S||22 + 1 +
ai {i () t}2 ,
i=1

where i () is the ith largest eigenvalue of matrix , and are some


positive constants. Note that by penalty function 1 , we only penalize
o-diagonal elements of . The t R+ is a suitably chosen constant. A choice
of t can be mean or median of sample eigenvalues. Weights ai s are shrinkage
weights associated with ith eigenvalue i . For ai = 1, i = 1, 2, p, the optimization problem (2.1) shrinks all the eigenvalues by same weight towards
the same constant t (mean of eigenvalues) and consequently (due to squared
deviation penalty on eigenvalues) this will yield maximum shrinkage in the
eigen-spectrum. The squared deviation penalty term for eigenvalues shrinkage is chosen from following points of interest: i) It is easy to interpret and
ii) this choice of penalty function yields a very fast optimization algorithm.
and denote
From here onwards we suppress the dependence of , on

, by .
For = 0, the standard lasso estimator for quadratic loss function and
its solution is (see 4 for derivation of this estimator):
ii = sii

(2.2)

)
(
ij = sign(sij ) max |sij | , 0 ,

i = j.

where sign(x) is sign of x and |x| is absolute value of x. It is clear from this
expression that a suciently large value of will result in sparse covariance
of (2.2) overcomes the
matrix estimate. But it is hard to assess whether
over-dispersion in the sample eigenvalues. The following eigenvalue plot (gure (2.1)) illustrates this phenomenon for a neighbourhood type (see 5 for
details on description of neighborhood type of matrix) of covariance matrix.
We simulated random vectors from multivariate normal distribution with
n = 50, p = 50.

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION7


Fig 2.1. Comparison of eigenvalues of sample and JPEN estimate of Covariance Matrix

As is evident from gure 2.1, eigenvalues of sample covariance matrix are


over-dispersed as most of them are either too large or close to zero. Eigenvalues of the Joint Penalty (JPEN) estimate (2.4) of the covariance matrix are
consistent for the eigenvalues of true covariance matrix. See 5 for detailed
discussion. Another drawback of the estimator (2.2) is that the estimate can
be negative denite [for details here see Xue et al. (2012)].
As argued earlier, to overcome the over-dispersion in sample covariance
matrix, we include eigenvalues squared deviation penalty. To illustrate its
be the minimizer of
advantage, consider = 0. After some algebra, let
(2.1) (for = 0) is given by:
(2.3)

= 1 (
1 +
T1 ) where
1 = (S + t U AU T )(I + U AU T )1 ,

where A = diag(A11 , A22 , , App ) with Aii = ai and U is a matrix of


1 in (2.3)
eigenvectors (refer to 4 for details for choice of U ). Note that

may not be symmetric but is. To see if the estimate above is positive

ASHWINI MAURYA

1 ) = min (
T ), after some algebra, we have:
denite, since min (
1
= min (SU (I + A)1 U T + t U A(I + A)1 U T )
min ()
min (SU (I + A)1 U T ) + t min (U A(I + A)1 U T )
( Aii )
min (S)

+ t min
ip 1 + Aii
1 + maxip (Aii )
Aii
t min
>0
ip 1 + Aii
for minip Aii > 0. This means that the eigenvalues squared deviation
provided that >
penalty improves S to a positive denite estimator
0, t > 0, minip Aii > 0. Note that the estimator (2.3) is well conditioned
but need not be sparse. Sparsity can be achieved by imposing 1 penalty to
each entry of covariance matrix. Simulation experiments have shown that in
general the minimizer of (2.1) is not positive denite for all values of > 0
and > 0. To achieve both well conditioned and sparse positive denite
estimator we optimize the objective function of (2.1) over specic region
of values of (, ) which depends upon S, t, and A. The proposed JPEN
estimator of covariance matrix is given by:
=
(2.4)

arg min
S,t,A,
=T |(,)R
1

where

and
S,t,A,
R
1

p
[
]

|| S||F + 1 +
ai {i () t}2 ,
i=1

{
}
S,t,A, =
S,t,A, ) ,
R
(,
)
:
(,
,
)

R
1
1
>0

{
log p
min (S)
,
= (, , ) : > 0,
n 1 + maxip Aii
}
( Aii )
+ t min
max(1 + Aii )1 .
ip 1 + Aii
2 ip

S,t,A, , is some
The minimization in (2.4) over is for xed (, ) R
1
positive constant. Note that such choice of , guarantees the minimum
eigenvalue of the estimate in (2.4) to be at least > 0. Theorem 3.1 estab S,t,A, is asymptotic nonempty.
lishes that the set R
1
2.2. Our Contribution. The main contributions are the following:
i) The proposed estimator is both sparse and well conditioned simultaneously. This approach allows to take advantage of a prior structure if known

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION9

on the eigenvalues of true covariance matrix.


ii) We establish theoretical consistency of proposed estimator in both Frobenius and Operator norm.
iii) The proposed algorithm is very fast, ecient and easily scalable to large
scale optimization problems.
We did simulations to compare the performance of the proposed estimators of covariance and inverse covariance matrix to some other existing
estimators for a number of structured covariance and inverse covariance matrices for varying sample sizes and dimensions. See 5 for further details.
3. Analysis of JPEN Method. Def: A random vector X is said to
have sub-gaussian distribution if for each y Rp {0} with y2 = 1 and
for t 0, there exist 0 < < such that
(3.1)

P{|y T (X E(X))| > t} et

2 /2

Theorem 3.1. X := (X1 , X2 , , Xp ) be a mean zero subgaussian random vector as dened in (3.1). Let S = (1/n)XX T be the sample covariance
S,t,A, be as dened in (2.4).
matrix and np < 1 as n = n(p) . Let R
1
S,t,A, we have R
S,t,A, R, in probability, where
For (, ) R
1
1
1
{
}

R,
=
g()

>

,
1
>0

where g() > 0 is the limit of smallest eigenvalue of S in probability and


is the empty set.
Next we give the theoretical results about the consistency of our proposed
estimator (2.4) of covariance matrix.
3.1. Covariance Matrix Estimation. We make the following assumptions
about the true covariance matrix 0 .
A0. The X := (X1 , X2 , , X
p ) be a mean zero vector with covariance
matrix 0 such that each Xi / 0ii has subgaussian distribution with parameter as dened in (3.1).
A1. With E = {(i, j) : 0ij = 0, i = j}, the cardinality(E) s for some
positive integer s.
A2. There exists a nite positive real number k > 0 such that 1/k
where min (0 ) and max (0 ) are the minimin (0 ) max (0 ) k,
mum and maximum eigenvalues of matrix 0 respectively.

10

ASHWINI MAURYA

Assumption A2 guarantees that the true covariance matrix 0 is well conditioned (i.e. all the eigenvalues are nite and positive). A well conditioned
means that [Ledoit and Wolf (2004)] inverting the matrix does not explodes
the estimation error. Assumption A1 is more of a denition which says that
the number of non-zero o diagonal elements are bounded by some positive integer. The Theorem 3.2 below gives the rate of convergence of the
proposed covariance matrix estimator (2.4) in Frobenius norm.
S,t,A, and
be as dened in (2.4). Under
Theorem 3.2. Let (, ) R
1
Assumptions A0, A1, A2 and for min (0 ) t max (0 ), we have:

( (p + s)log p )
0 F = OP

(3.2)
n
Here the worst part of rate of convergence comes from estimating the
diagonal
(entries. For) correlation matrix estimation, the rate can be improved
to OP
s log p/n (Corollary 3.2).
Let 0 = W W be the variance correlation decomposition of true covariance matrix 0 where is true correlation matrix and W is the a diagonal
be the solution to following optimatrix of true standard deviations. Let K
mization problem.
=
(3.3) K

p
{
}

2 + K 1 +
K
ai {i (K)t}2

arg min

,t,A,
K=K T |(,)R
1a

i=1

,t,A,
where R
is given by:
1a

(3.4)

,t,A,
R
=
1a

>0

{
}

,t,A,
(, ) : (, , ) R
) ,
1a

and

min ()
log p

,t,A,

R1a
= (, , ) : > 0,
,
n 1 + maxip Aii
}
( Aii )
+ t min
max(1 + Aii )1 .
ip 1 + Aii
2 ip
is the sample counterpart of . Similar to Theorem 3.1, the following
and

,t,A,
corollary establishes that the set of symmetric dierence between R
1a
and its asymptotic counterpart R,
1a is empty as n = n(p) .

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


11

Corollary 3.1. X := (X1 , X2 , , Xp ) be a mean zero random vector


where each {Xi }i=1, ,p has subgaussian distribution as dened in (3.1). Let
be the sample correlation matrix. Let p < 1 as n = n(p) . Let

,t,A,

,t,A,
R1a
be as dened in (3.4). We have R
R,
1a
1a in probability,
where
}
{
2
R,
=
(1

>

1a
>0

We have the following rate of convergence for correlation matrix estimate


of (3.3).
K
Under the Assumption of A0 , A1 , A2 , min () t

,t,A,
max () and for (, ) R
,
1a
( s log p )
F = OP
(3.5)
K
.
n
Corollary 3.2.

The improved rate is due to the fact that for correlation matrix, all the
c := W
K
W
, where W
is a diagonal matrix
diagonal entries are one. Dene
of the estimates of true standard deviations based on observations. The
following theorem gives the rate of convergence of correlation matrix based
covariance matrix estimator in operator norm.
Theorem 3.3.

,t,A,

R1a
,

Under the assumption A0, A1, A2 and for (, )

(3.6)

c 0 = OP

(s + 1)log p )
.
n

c 0 . Therefore the rate of convergence


c 0 F p
Note that
in Frobenius norm of the correlation matrix based estimator of covariance
matrix is the same as the one dened in (2.4).
Remark: This rate of operator norm convergence is same as the one obtained in Bickel and Levina (2008) for banded covariance matrices. Although
the method of proof is very dierent but the similar rate of convergence in
operator norm is due to the similar kind of tail inequality for sample covariance matrix of Gaussian and sub-Gaussian random variables [Ravikumar
et al. (2011)]. Rothman (2012) propose an estimator of covariance matrix
based on similar loss function but the choice of dierent penalty function
yields very dierent estimate. This is also exhibited in simulation analysis of 5. Moreover our proposed estimator is applicable to estimate any

12

ASHWINI MAURYA

non-negative covariance matrix which is not the case for Rothmans (2012)
estimator (since Rothmans estimator involves logarithmic of determinant
of the estimator as another penalty to keep all the eigenvalues of estimated
matrix away from zero).
3.2. Estimation of Inverse Covariance Matrix. Notation: We shall use
for inverse covariance matrix.
Assumptions: We make following assumptions about the true inverse covariance matrix 0 . Let 0 = 1
0
B0. The random vector X := (X1 , X2 , , Xp ) is a mean zero vector with
covariance matrix 0 such that each Xi / 0ii has subgaussian distribution
with parameter as in (3.1).
B1. With H = {(i, j) : 0ij = 0, i = j}, the cardinality(H) s for some
positive integer s.

B2. There exist 0 < k < large enough such that (1/
k) min (0 )

max (0 ) k and min (S + I) 1/(k) for all log p/n and S =


(1/n)XX T .
Remark: In Assumption B2, we require the minimum eigenvalue of S 1 :=
(S + I)1 to be bounded above by some positive constant. Let
limn(p) p/n = < 1, then by a result from Bai and Yin (1993),
for
limn(p) min (S) = g () > 0. Consequently min (S + I) 1/(k)

large enough k. This condition is required in establishing the rate of convergence of estimator (3.7) (see the Theorem 3.5).
Dene the JPEN estimator of inverse covariance matrix 0 as the solution
to the following optimization problem,
(3.7)
p
[
]

arg min
S 1 2 + 1 +
ai {i () t}2
S
=T |(,)R
2

,t,A,

where
(3.8)
with
S ,t,A,
R
2

S ,t,A, =
R
2

i=1

>0

{
}
S ,t,A, ) ,
(, ) : (, , ) R
2

{
min (S 1 )
= (, , ) : > 0, logn p , 1+
maxip Aii
}
( A )
1 ,
ii
+ t minip 1+A

max
(1
+
A
)
ip
ii
2
ii

for A = diag(A11 , A22 , , App ) with Aii = ai and ai dened in (3.7).


Remark: Note that this choice of S is positive denite matrix and therefore
invertible.

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


13

Theorem 3.4. X := (X1 , X2 , , Xp ) be a mean zero vector where


each {Xi }i=1, ,p has subgaussian
distribution as dened in (3.1). Let S =
(1/n)XX T , S = S + I for log p/n. Let np < 1 as n = n(p)
S ,t,A, R, in
S ,t,A, be as dened in (3.8). We have R
. Let R
2
2
2
probability, where
{
}
R,
2 = : > 0, g1 () > ,
g1 () = limn=n(p) min (S 1 ) and is empty set.
The following theorem gives the consistency of inverse covariance matrix
estimator (3.7) in Frobenius norm.
be the minimizer as dened in (3.7). Under AsTheorem 3.5. Let
S ,t,A, and min (0 ) t
sumptions B0, B1, B2 and for (, ) R
2
max (0 ), we have:

( (p + s)log p )
0 F = OP
(3.9)

n
Note that the rate of convergence here is the same as for the covariance
be the solution to following optimization problem:
matrix estimation. Let L
(3.10)
p
{
}

1 2

L + L 1 +
L=
arg min
ai {i (L) t}2

,t,A,
L=LT |(,)R
2a

i=1

1 = W
S 1 W
and
where

,t,A,
R
=
(3.11)
2a

>0

with

,t,A,
R
2a

{
}

,t,A,
(, ) : (, , ) R
)
,
2a

{
1 )
min (
= (, , ) : > 0, logn p , 1+
maxip Aii
}
( Aii )
1 ,
+ t minip 1+A

min
(1
+
A
)
ip
ii
2
ii

Corollary 3.3. X := (X1 , X2 , , Xp ) be a mean zero vector where


each {Xi }i=1, ,p has subgaussian distribution as dened in (3.1). Let np

,t,A,
be as dened in (3.8). For (, )
< 1 as n = n(p) and R
2a

,t,A,
,t,A,
R
, we have R
R,
2a
2a
2a in probability, where
}
{
R,
=
g
()

,
2
2a
>0

14

ASHWINI MAURYA

1 and is empty set.


g2 () is limit of smallest eigenvalue of
We have following rate of convergence of the inverse of the correlation
matrix estimator given in (3.10).
be the minimizer of (3.10). Under the assumpLet L

,t,A,
tion B0, B1, B2 and for (, ) R
,
2a
( s log p )
1
F = OP
(3.12)
L
n
Corollary 3.4.

This rate is the same as that of correlation matrix estimator given in


(3.3).
c := W
1 L
W
1 . We have the following result on the operator norm
Dene
consistency of inverse correlation matrix based inverse covariance matrix.
Theorem 3.6.

,t,A,

R2a
,

Under the assumption of B0, B1, B2 and for (, )

( (s + 1)log p )
c 0 = OP
(3.13)

n
c 0 F p
0 , the rate of convergence of the inverse
Since
covariance matrix based on inverse correlation matrix is same as that of the
covariance matrix estimator based on correlation matrix.
4. An Algorithm.
4.1. Covariance Matrix Estimation:. The optimization problem (2.4) can
be written as:
=

(4.1)

f (),

arg min
S,t,A,
=T |(,)R
1

where
f () = ||

S||2F

+ 1 +

ai {i () t}2 .

i=1

A solution to (4.1) is given by:


ii = Mii ,

(4.2)

{
(
)
ij = sign Mij max |Mij |

,0 ,
2(1 + maxip Aii )

i = j,

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


15

where
M=

)
1(
M1 + M1T
2

with M1 = (S + t U AU T )(I + U AU T )1 ,

S,t,A, .
A = diag(A11 , A22 , , App ) with Aii = ai and (, ) R
1
Choice of U:
Note that U is the matrix of eigenvectors of , which is unknown. One choice
of U is matrix of eigenvectors of corresponding eigenvalue decomposition of
S + I for some > 0 i.e. let S + I = U1 D1 U1T , then take U = U1 .
Choice of and :
For given value of , we can nd the value of satisfying:
{
< 2 (1 + min Aii )
ip

}
min (S)
+ 2 t min Aii 2,
ip
1 + maxi Aii

S,t,A, which guarantees that the minimum


and such choice of (, ) R
1
eigenvalue of the estimate (4.2) will be at least > 0.
4.2. Inverse Covariance Matrix Estimation:. To get an expression of inverse covariance matrix estimate, we replace S by S 1 in (4.2). Let A be
the weight matrix for eigenvalues of inverse covariance matrix of optimization problem (3.7), then an optimal solution to optimization problem (3.7)
is give by:
(4.3)

ii = M

ii
{
(
)
ij = sign M max |M |

ij

ij

}
.

2(1+ maxip Aii ) , 0

i = j.

where M = 12 (M2 + M2T ), M2 = (S 1 + t U1 AU1T )(I + U1 AU1T )1 ,


S ,t,A, . A choice of U1 is matrix of eigenvectors of eigenand (, ) R
2
decomposition of S 1 = U1 D1 U1T .
4.2.1. Computational Time. We compare the computational timing of
our algorithm to some other existing algorithms glasso[12] (Friedman et
al.(2008)), PDSCE [28] (Rothman (2011)). Note that the exact timing of
these algorithm also depends upon the implementation, platform etc. (we
did our computations in R on a AMD 2.8GHz processor). For each estimate,
the optimal tuning parameter was obtained by minimizing the empirical loss
function
(4.4)

Srobust F

16

ASHWINI MAURYA

is an estimate of the the covariance matrix, Srobust is the sample


where
covariance matrix based on 20000 sample observations (refer the section 5
for detailed discussion).
Figure 4.1 illustrates the total computational time taken to estimate the
covariance matrix by Glasso, P DSCE and JP EN algorithms for dierent
values of p for Toeplitz type of covariance matrix on log-log scale (see section
5 for Toeplitz type of covariance matrix). Although the proposed method
S,t,A, , our algorithm
requires optimization over a grid of values of (, ) R
1
is very fast and easily scalable to large scale data analysis problems.

50 100
5

10

time in seceonds

500

5000

JPEN
glasso
PDSCE

500

1000

2000

5000

number of covariates p

Fig 4.1. Timing comparison of JPEN, Graphical Lasso(Glasso), PDSCE on log-log scale.

5. Simulation Results.
We compare the performance of the proposed method to other existing
methods on simulated data for four types of structured covariance and inverse covariance matrices.
(i) Hub Graph: The rows/columns of 0 are partitioned into J equallysized disjoint groups: {V1 V2 , ..., VJ } = {1, 2, ..., p}, each group is
associated with a pivotal row k. Let size |V1 | = s. We set 0i,j = 0j,i =
for i Vk and 0i,j = 0j,i = 0 otherwise. In our experiment, J = [p/s], k =
1, s + 1, 2s + 1, ..., and we always take = 1/(s + 1) with J = 20.

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


17

(ii) Neighborhood Graph: We rst uniformly sample (y1 , y2 , ..., yn )


from a unit square. We then set 0i,j = 0j,i = with probability
1
( 2) exp(4yi yj 2 ). The remaining entries of 0 are set to be zero.
The number of nonzero o-diagonal elements of each row or column is restricted to be smaller than [1/] where is set to be 0.245.
(iii) Toeplitz Matrix: We set 0i,j = 2 for i = j; 0i,j = |0.75||ij|
for |i j| = 1, 2; and 0i,j = 0 otherwise.
(iv) Block Diagonal Matrix: In this setting 0 is a block diagonal
matrix with varying block size. For p = 500 number of blocks is 4 and for
p = 1000 the number of blocks is 6. Each block of covariance matrix is taken
to be Toeplitz type matrix as in case (iii).
We chose similar structure of 0 for simulations. For all these choices
of covariance and inverse covariance matrices, we generate random vectors from multivariate normal distribution with varying n and p. We chose
n = 50, 100 and p = 500, 1000. Here we report the results for n = 50 and
p = 500, 1000. Please refer the section 8 for detailed simulation analysis.
We compare the performance of proposed covariance matrix estimator to
to graphical lasso, PDSC Estimate [Rothman (2011)] and Ledoit-Wolf estimate of covariance matrix. The JPEN estimate (4.2) of the covariance
matrix was computed using R software(version 3.0.2). The graphical lasso
estimate of the covariance matrix was computed using R package glasso
(http://statweb.stanford.edu/ tibs/glasso/). The Ledoit-Wolf estimate was
obtained using code from (http: //www.econ.uzh.ch/faculty/wolf/ publications.html#9). The PDSC estimate was obtained using PDSCE package (http://cran. r-project. org/web/ packages/PDSCE/index.html). For
inverse covariance matrix performance comparison we only include glasso
and PDSCE. For each of covariance and inverse covariance matrix estimate,
we calculate Average Relative Error (ARE) based on 50 iterations using
following formula:
= |log(f (S, ))
log(f (S, ))|/|(log(f (S, ))|,
ARE(, )
where f (S, ) is density of multivariate normal distribution, S is sample co is the estimate of . Other
variance matrix, is the true covariance,
choices of performance criteria are Kullback Leibler used by Yuan and Lin
[2007], Bickel and Levina [2008]. The optimal values of tuning parameters
for and were obtained by minimizing empirical loss function given in
(4.4). Simulation shows that the optimal choice of tuning parameters and

18

ASHWINI MAURYA

are same as if we replace Srobust by true covariance matrix . The average


relative error and their standard deviations are given in table 5.1. The numbers in the bracket are the standard error estimate of relative error. Table
5.1 gives average relative errors and standard errors of the covariance matrix estimates based on glasso, Ledoit-Wolf, PDSCE and JPEN for n = 50
and p = 500, 1000. The glasso estimate of covariance matrix performs very
poorly among all the methods. The Ledoit-Wolf estimate performs good
but the estimate is generally not sparse. Also the eigenvalues estimates of
Ledoit-Wolf estimator is heavily shrunk towards the center than the true
eigenvalues. The JPEN estimators outperforms other estimators for most of
the values of p for all four type of covariance matrices. PDSCE estimates
have lower average relative error and close to JPEN. This could be due to
the fact the PDSCE and JPEN uses quadratic optimization function with a
dierent penalty function. Table 5.2 reports the average relative error and
their standard deviations for inverse covariance matrix estimation. Here we
do not include the Ledoit-Wolf estimator and only compare glasso, PDSCE
estimates with proposed JPEN estimator. The JPEN estimate of inverse
covariance matrix outperforms other methods for all values of p = 500 and
p = 1000 for all four types of structured inverse covariance matrices. Figure
5.1 report the zero recovery plot of percentage of time each zero element of
covariance matriz was truly recovered based on 50 realizations. The JPEN
estimates recovers the true zeros for about 90% of times for Hub and Neighborhood type of covariance matrix. Our proposed estimator also reect the
recovery of true structure of non-zero entries and any pattern among the
rows/columns of covariance matrix.
Table 5.1
Covariance matrix estimation

Ledoit-Wolf
Glasso
PDSCE
JPEN
Ledoit-Wolf
Glasso
PDSCE
JPEN

Hub type matrix


p=500
p=1000

Neighborhood type matrix


p=500
p=1000

2.13(0.103)
2.43(0.043)
10.8(0.06)
14.7(0.052)
1.22(0.052)
2.23(0.051)
1.74(0.051)
1.97(0.037)
Block type matrix
1.54(0.102)
2.96(0.0903)
30.8(0.0725)
33.9(0.063)
1.62(0.118)
3.08(0.0906)
1.01(0.101)
1.91(0.0909)

1.36(0.054)
2.89(0.028)
11.9(0.056)
14.3(0.03)
0.912(0.077)
1.85(0.028)
0.828(0.052)
1.66(0.028)
Toeplitz type matrix
1.967(0.041)
2.344(0.028)
12.741(0.051) 18.22(0.04)
0.873(0.042)
1.82(0.028)
0.707(0.042)
1.816(0.028)

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


19
Table 5.2
Inverse covariance matrix estimation

Glasoo
PDSCE
JPEN
Glasoo
PDSCE
JPEN

Hub type matrix


p=500
p=1000

Neighborhood type matrix


p=500
p=1000

13.4(0.057)
17.5(0.065)
1.12(0.046)
2.34(0.044)
0.613(0.033)
0.282(0.028)
Block type matrix
12.7(0.0406)
13.6(0.0316)
1.02(0.0562)
1.9(0.038)
0.372(0.0481) 0.579(0.0328)

12.694(0.03)
13.596(0.033)
0.958(0.04)
1.85(0.038)
0.392(0.038)
0.525(0.036)
Toeplitz type matrix
19.4(0.037)
20.7(0.022)
1.91(0.064)
3.7(0.037)
0.664(0.068)
2.42(0.045)

To see the implication of eigenvalues shrinkage penalty as compared to


other methods, we plot (Figure 5.2) the eigenvalues of estimated covariance
matrix for n = 20,p = 50. JPEN estimates of eigen-spectrum are far better
than other methods and closest being PDSC estimates of eigenvalues.
Fig 5.1. Heatmap of zeros identified in covariance matrix out of 50 realizations. White
color is 50/50 zeros identified, black color is 0/50 zeros identified.

20

ASHWINI MAURYA
Fig 5.2. Eigenvalues plot for n = 20, p = 50 based on 50 realizations

6. Colon Tumor Classification Example. In this section, we compare performance of our proposed covariance matrix estimator for Linear
Discriminant Analysis (LDA) classication of tumors using gene expression data from Alon et al. (1999). In this experiment, colon adenocarcinoma tissue samples were collected, 40 of which were tumor tissues and
22 non-tumor tissues. Tissue samples were analyzed using an Aymetrix
oligonucleotide array. The data were processed, ltered, and reduced to a
subset of 2,000 gene expression values with the largest minimal intensity over
the 62 tissue samples (source: http://genomics-pubs.princeton.edu/oncology
/aydata/index.html ). Additional information about the dataset and its
pre-processing can be found in Alon et al. (1999). In our analysis, we reduce the number of genes by selecting p most signicant genes based on
logistic regression. We obtain estimates of inverse covariance matrix for
p = 50, 100, 200 and then use LDA to classify these tissues as either tumorous or non-tumorous (normal). We classify each test observation x to
either class k = 0 or k = 1 using the LDA rule
{
}
k + log(k ) .
k 1 k
(6.1)
k (x) = arg max xT
2
k
where k is the proportion of class k observations in the training data, k is
:=
1 is an estithe sample mean for class k on the training data, and
mator of the inverse of the common covariance matrix on the training data
computed by one of the methods under consideration. Tuning parameters
and were chosen using 5-fold cross validation. To create training and test

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


21

sets, we randomly split the data into a training set of size 42 and a testing
set of size 20; following the approach used by Wang et al. (2007), we require
the training set to have 27 tumor samples and 15 non-tumor samples. We
repeat the split at random 100 times and measure the average classication
error.
Table 6.1
Averages and standard errors of classification errors over 100 replications in %.
Method
Logistic Regression
SVM
Naive Bayes
Graphical Lasso
Joint Penalty

p=50
21.0(0.84)
16.70(0.85)
13.3(0.75)
10.9(1.3)
9.9(0.98)

p=100
19.31(0.89)
16.76(0.97)
14.33(0.85)
9.4(0.89)
8.9(0.93)

p=200
21.5(0.85)
18.18(0.96)
14.63(0.75)
9.8(0.90)
8.2(0.81)

Since we do not have separate validation set, we do the 5-fold cross validation on training data. At each split, we divide the training data into 5
subsets (fold) where 4 subsets are used to estimate the covariance matrix
and 1 subset is used to measure the classiers performance. For each split,
this procedure is repeated 5 times by taking one of the 5 subsets as validation data. An optimal combination of and is obtained by minimizing
the average classication error. Tuning parameter for graphical lasso was
obtained by similar criteria.
The average classication errors with standard errors over the 100 splits are
presented in Table 6.1. Since the sample size is less than the number of genes,
we omit the inverse sample covariance matrix as its not well dened and instead include the naive Bayes and support vector machine classiers. Naive
Bayes has been shown to perform better than the sample covariance matrix
in high-dimensional settings (Bickel and Levina (2004)). Support Vector
Machine(SVM) is another popular choice for high dimensional classication
tool (Chih-Wei Hsu et al. (2010)). Among all the methods covariance matrix
based based LDA classiers perform far better that Naive Bayes, SVM and
Logistic Regression. For all other classiers the classication performance
deteriorates for increasing p. For larger p i.e. when more genes are added to
the data set, the classication performance of JPEN estimate based LDA
classier improves which is dierent from Rothman et el. (2008) analysis of
same data set where the authors pointed out that as more genes are added
to the data set, the classiers performance deteriorates. Note that the classication error of a covariance matrix based classier initially decreases for
increasing p and deteriorates for large p. This is due to the fact that as dimension of covariance matrix increases, the estimator does not remain very

22

ASHWINI MAURYA

informative. In particular for p = 2000, when all the genes are used in data
analysis, the classication error of JPEN and glasso is about 30% which is
much higher than for p = 50.
7. Summary. We have proposed and analyzed regularized estimation
of large covariance and inverse covariance matrices using joint penalty. One
of its biggest advantages is that the optimization carries no computational
burden unlike many other methods for covariance regularization and the resulting algorithm is very fast, ecient and easily scalable to large scale data
analysis problems. We show that our estimators of covariance and inverse
covariance matrix are consistent in the Frobenius and operator norm. The
operator norm consistency guarantees consistency for principal components,
hence we expect that PCA will be one of the most important applications
of the method. Although the estimators in (2.4) and (3.7) do not require
any assumption on the structure of true covariance and inverse covariance
matrices respectively, but priori knowledge of any structure of true covariance matrix might be helpful to choose a suitable weight matrix and hence
improve estimation.
Acknowledgments
I would like to express my deep gratitude to Professor Hira L. Koul for his
valuable and constructive suggestions during the planning and development
of this research work.
References.
[1] Alon U., Barkai N., Notterman D., Gish K., Ybarra S., Mack D. and Levine A., Broad
patterns of gene expression revealed by clustering analysis of tumor and normal colon
tissues probed by oligonucleotide arrays. Proceeding of National Academy of Science
USA, 96(12):67456750, 1999.
[2] Banerjee O., El Ghaoui L. and dAspremont A., Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9,485-516, 2008.
[3] Bickel P. and Levina E., Regulatized estimation of large covariance matrices. The
Annals of Statistics, 36,199-227, 2008.
[4] Bickel P. and Levina E., Covariance regularization by thresholding The Annals of
Statistics, Volume 36, 2577-2604, 2008.
[5] Cai T., Zhang C. and Zhou H., Optimal rates of convergence for covariance matrix
estimation. The Annals of Statistics 38, 2118-2144, 2010.
[6] Cai T., Liu W. and Luo X., A constrained 1 minimization approach to sparse precision
matrix estimation. Journal of American Statistical Association 106, 594-607, 2011.
[7] Chaudhury S., Drton M. and Richardson T., Estimation of a covariance matrix with
zeros. Biometrica, Volume 94, Issue 1Pp. 199-216, 2007.
[8] Clarke R., Ressom H., Wang A., Xuan J., Liu M., Gehan E. and Wang Y., The
properties of high-dimensional data spaces: implications for exploring gene and protein
expression data. Nat Rev Cancer. Jan 2008; 8(1): 3749.
[9] Dempster A., covariance Selection. Biometrika, 32,95-108, 1972.

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


23
[10] Dey D. and Srinivasan C., Estimation of a Covariance Matrix under Steins Loss.
Annals of Statistics Volume 13, Number 4, 1581-1591, 1985.
[11] Fan J., Fan Y. and LV J., High-dimensional covariance matrix estimation using a
factor model. Journal of Econometrics
[12] Friedman J., Hastie T. and Tibshirani R., Sparse inverse covariance estimation with
the graphical lasso. Biostatistics. 2008 Jul; 9(3),432-441. 2007.
[13] Geman S., A Limit Theorem for the Norm of Random Matrices. Annals of Statistics,
Volume 8, Number 2, 252-261, 1980.
[14] Bein J., Tibshirani R., Sparse estimation of a covariance matrix. Biometrica, Volume
98, Issue 4Pp. 807-820, 2011
[15] Johnstone I. and LU Y., Sparse principal components analysis. Unpublished
manuscript, 2004.
[16] El Karoui N., Spectrum estimation for large dimensional covariance matrices using
random matrix theory. Annals of Statistics, Volume 36, Number 6 (2008), 2757-2790.
[17] El Karoui N., Operator norm consistent estimation of large dimensional sparse covariance matrices. Annals of Statistics. 36:2717-56, 2008.
[18] Ledoit O. and Wolf M., A well-conditioned estimator for large-dimensional covariance
matrices. Journal of Multivariate Analysis, 88 (2004), pp. 365411.
[19] Marcenko V. and Pastur L., Distributions of eigenvalues of some sets of random
matrices. Math. USSR-Sb 1 507536,1967.
[20] Mardia K., Kent J. and Bibby J., Multivariate Analysis. Academic Press, New York.
MR0560319, 1979.
[21] Maurya Ashwini., A joint convex penalty for inverse covariance matrix estimation.
Computational Statistics and Data Analysis, Volume 75, July 2014, Pages 1527.
[22] Maurya Ashwini., A suuplement to A well conditioned and sparse estimate of covariance and inverse covariance matrix using joint penalty. Submitted to Annals of
Statistics, Nov, 2014.
[23] Meinshausen and B
uhlmann P., High dimensional graphs and variable selection with
the lasso, Annals of Statistics 34, 1436-1462 2006.
[24] Pass G., Chowdhury A. and Torgeson C., A Picture of Search. The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.
[25] Pourahmadi M., Modeling covariance matrices: The GLM and regularization perspectives. Statistical Science, 26:369-87, 2011.
[26] Pourahmadi M., Cholesky decompositions and estimation of a covariance matrix: orthogonality of variance-correlation parameters. Biometrika 94 (2007), no. 4, 10061013.
[27] Ravikumar P., Wainwright M., Raskutti G. and Yu B., High-dimensional covariance
estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal
of Statistics, Volume 5, 935-980, 2011.
[28] Rothman A. J., Bickel P. J., Levina E. and Zhu J., Sparse permutation invariant
covariance estimation. Electron. J. Stat. 2 494-515,2008.
[29] Rothman A., Positive denite estimators of large covariance matrices. Biometrica,
Volume 99, Issue 3Pp. 733-740, 2012.
[30] Wainwright M., Ravikumar P. and Laerty J., High-dimensional graphical model
selection using L1 -regularized logistic regression. Proceedings of Advances in Neural
In formation Processing Systems, 2006.
[31] Stein C., Estimation of a covariance matrix. Rietz lecture, 39th Annual Meeting IMS.
Atlanta, Georgia, 1975
[32] Wang S., Kuo T. and Hsu C., Trace bounds on the solution of the algebraic matrix
Riccati and Lyapunov equation. IEEE Transactions on Automatic Control, VOL AC31, NO. 7, July 1986.

24

ASHWINI MAURYA

[33] Wang L., Zhu J. and Zou H., Hybrid huberized support vector machines for microarray classication. In ICML 07: Proceedings of the 24th International Conference on
Machine Learning, pages 983990, New York, NY, USA. ACM Press. 2007
[34] Xue L., Ma S. and Zou Hui, Positive-Denite l1-Penalized Estimation of Large Covariance Matrices. Journal of American Statistical Association, Theory and Methods,
Vol 107, No.500,2012.
[35] Yin Y. and Bai Z., Limit of the smallest eigenvalue of large dimensional sample
covariance matrix. The Annals of Probability Vol 21, No.3, 1275-1294,1993.
[36] Yuan M. and Lin Y., Model selection and estimation in the Gaussian graphical model.
Biometrika 94(1), 19-35,2007.
[37] Yuan M., Sparse inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research 11, 2261-2286, 2009.

[38] Zhou S., Rutimann P., Xu M. and Buhlmann


P., High-dimensional covariance estimation based on Gaussian graphical models. Journal of Machine Learning Research,
to appear, 2011 .
[39] Zou H., Hastie T. and Tibshirani R., Sparse principal components analysis. J. Comput. Graph. Statist. 15 265286 MR2252527, 2006.

8. Technical Proofs.
Proof of Theorem 3.1. Let = U DU T be the eigenvalue decomposition of . Let,

T +
f1 (D) = U DU T S2F + U DU
ai {i () t}2
1
1ip

T
= tr(D ) 2 tr(SU DU ) + tr(S ) + U DU
1
2

+{tr(AD2 ) 2 t tr(AD) + t2 tr(A)}


T
= tr(D2 (I + A)) 2 tr(D(U T SU + t A) + tr(S 2 ) + U DU
1
+ t2 tr(A)
Note that this is quadratic in D and since (I + A) is a positive denite
matrix, f1 (D) is convex. Dierentiating with respect to D, we obtain
f1 (D)
= 2D(I + A) 2(U SU T + t A) + U U T sign(U DU T )
D
f1 (D)
D

= 0 satises,
= (U SU T + t A)(I + A)1 /2U U T sign(U DU
T )(I + A)1 .
D
implies positive deniteness of
Positive deniteness of eigenvalues matrix D
Next we derive the lower bound on the smallest eigenvalue of D. Note
.
that
T )(I + A)1 } = max (I + A)1 =
max {U U T sign(U DU

1
1+ minip Aii .

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


25

Hence we obtain,
min {U SU T (I + A)1 } + t min {A(I + A)1 }
min (D)

2 1 + minip Aii
min (S)
Aii

+ t min

.
ip 1 + Aii
1 + maxip Aii
2 1 + minip Aii

For log p/n and log p/n, we have min (S) g() > 0 in
S,t,A, R,
probability by a theorem in [34]. Next we shall prove that R
1
1
in probability. Dene,
Y,,, =

min (S)
Aii

1
+ t min

ip 1 + Aii
1 + maxip Aii
2 1 + minip Aii

Since min (S) g(), therefore for given > 0, there exist a positive integer
N1 such that for all n = n(p) N1 ,
(
)
P |Y,,, g()| < 1 ,
S,t,A, R, = .
i.e. g() Y,,,xi g() + . Take 0, we have R
1
1
Hence the theorem.
Remark: Note that the above result is true in asymptotic sense under
assumption of Theorem 3.1. For nite samples when n < p, min (S) = 0
and because minip Aii > 0,
Aii

1 + Aii
2 1 + minip Aii
{
}
1
t min Aii
=
> 0,
ip
1 + t minip Aii
2

t min
min (D)
ip

for suciently large , t . This guarantees the existence of nonempty set


S,t,A, for nite samples.
R
1
Proof of Theorem 3.2. Let
f () = || S||2F + 1 +

ai {i () t}2 ,

i=1

where is the matrix with all the diagonal elements set to zero. Dene
the function Q(.) as following:
Q() = f () f (0 )

26

ASHWINI MAURYA

where 0 is the true covariance matrix and is any other covariance matrix.
Let = U DU T be eigenvalue decomposition of , D is diagonal matrix of
eigenvalues and U is matrix of eigenvectors. We have,
(8.1)

Q() = S2F + 1 + tr(AD2 2t AD + t2 A)


2
2
2
0 S2F
0 1 tr(AD0 2t AD0 + t A)

where A = diag(a1 , a2 , , ap ) and 0 = U0 D0 U0T is eigenvalue decomposition of 0 . Let n (M ) := { : = T , 2 = M rn , 0 < M < }.


minimizes the Q() or equivalently
=
0 minimizes
The estimate
be its solution,
the G() = Q(0 + ). Note that G() is convex and if

then we have G() G(0) = 0. Therefore if we can show that G() is


lies within sphere
non-negative for n (M ), thiswill imply that the

of radius M rn . We require rn = (p+s)n log p 0 as n = n(p) goes to .


This will give consistency of our estimate in Frobenius norm at rate O(rn ).
S2F 0 S2F = tr( 2 S + S S) tr(0 0 20 S + S S)
= tr( 0 0 ) 2 tr(( 0 )S)
= tr((0 + ) (0 + ) 0 0 ) 2 tr( S)
= tr( ) 2 tr( (S 0 ))
Next, we bound term involving S in above expression, we have

|tr((0 S))|
|ij (0ij Sij )| +
|ii (0ii Sii )|
i=1

i=j

max(|0ij Sij |) 1 +

i=j

p max(|0ii Sii |)
2ii
i=1

i=1

{ log p
p log p + }

C0 (1 + ) max(0ii )
1 +
2
i
n
n

{ log p
p log p + }

C1
1 +
2
n
n
holds with high probability by a result (Lemma 1) from Ravikumar et al.
(2011) on the tail inequality for sample covariance matrix of sub-gaussian
random vectors and where C1 = C0 (1 + ) maxi (0ii ), C0 > 0. Next we
obtain upper bound on the terms involving in (3.7). we have,
tr(AD2 2t AD) tr(AD02 2t AD0 )
= tr{A(U T 2 U U0T 20 U0 )} 2t tr{A(U T U U0T 0 U0 )}

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


27

(i)tr(A(U T 2 U U0T 20 U0 )) 1 (A)tr(2 20 )


tr{( + 0 )2 20 )} tr(2 0 + )

2k p+ F + tr( ).
(ii)tr(A(U T U U0T 0 U0 )) 1 (A)tr( 0 )
tr{( + 0 ) 0 )} tr()

p+ F .

To bound the term ( +


0 1 0 1 ) in (3.7), let E be index set as dened in Assumption A.2 of Theorem 3.2. Then using the triangle inequality,
we obtain,

( +
1 0 1 )
0 1 0 1 ) = (E + 0 1 + E

(
1 0 1 )
0 1 E 1 + E

(
1 E 1 )
E

S,t,A, and
Let = (C1 /) log p/n, = (C1 /1 ) log p/n, where (, ) R
1
t k,
we obtain,
(1/k)

{ log p
p log p + }

( 1 ) +
F
G() tr( ) 2 C1
n
n

}
C1 log p {

2k pF + 2F + 2 p+ F
1
n

)
C1 log p (
+
E 1
E 1

C
log
p
p log p +
1
2F (1
) 2C1
F
1
n
n

) C1 log p (
)
log p (
2C1
E 1 +
E 1
E 1 +
E 1
n

2C1 log p
pF .

(1 + k)

Also because
s F ,
(i,j)E,i=j ij
E 1 =

log p
C1 log p
log p (
C1 )
2C1
E 1 +
E 1
E 1 2C1 +
n

n
n

for suciently small . Also

log p
log p
E 2C1
s F
2C1
n
n

28

ASHWINI MAURYA

Therefore,
G()

C1 log p )
p log p +
F 1
2C1
F
1
n
n

)
2C1 p log p (
log p
+

1 + k)
F 2C1
s F
1
n
n

(
C1 log p )
(p + s)log p +
2
F
F 1
2 C1
1
n
n

(p + s) log p
(p + s) log p +
(1 + k)
2C1
F 2C1
F
n
1
n

[
C1 log p
(p + s) log p (
1 + k )]
+ 2
+ 1
F 1
2 F
C1 1 +
1
n
n
1

[
]
C1 log p
(p + s) log p
+ 2F 1
2C1 1
F
1
n
n

2C
(1+
k)
1
[
]
C1 log p 2C1 +
1
+ 2F 1

1
n
M

[
C1 log p 2C1 ]

+ 2F 1
1
n
M
0,
(
2

for all suciently large n and M . Hence the theorem.


Proof of Corollary 3.1. Note that for a correlation matrix, all the variables are standardized to have mean zero and variance 1.Using a result from
Bai and Yin (1993), we have limn=n(p)0 min (S) = (1 )2 > 0, for < 1.
Rest of the proof of this corollary is similar to Theorem 3.1 and hence omitted.
Proof of Corollary 3.2. This corollary is special case of Theorem 3.2
when all of the variables are standardized to have mean zero and variance
1.
Proof of theorem 3.3. We have,
c 0 = W
K
W
W W

W K
W
W
W
W (KW

) + K
W
W .
+W
+ W

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


29

= O(1). Also,
Since = O(1), it follows from Corollary (3.2) that K
2 W 2 = max
W

x2 =1

i2 wi2 )|
|(w
i2 wi2 )|x2i max |(w

i=1

= max |(w
i2 wi2 )| = O
1ip

1ip

x2i

i=1

log p )
.
n

holds with high probability by using a result (Lemma 1) from Ravikumar et


al. (2011) on the tail inequality on entries of sample covariance matrix of sub W W
2 W 2 ,
gaussian random vectors. Next we shall shows that W
(where AB means A=OP (B) and B=OP (A)). We have,
W = max
W

x2 =1

C3

i=1

(w
2 wi2 ) 2
| i
|xi
w
i + wi
x2 =1

|(w
i wi )|x2i = max

i=1

2 W 2 .
|(w
i2 wi2 )|x2i = C3 W

i=1

where we have used the fact that the true standard deviations are well above
zero, i.e., 0 < C3 < such that 1/C3 wi1 C3 i = 1, 2, , p, and
sample standard deviation are all positive, i.e, w
i > 0i = 1, 2, , p. Now
2
2

= O(1). Which implies


since W W (W W , this follows
that W
log p )
s log p
2

+ n . Hence the Theorem 3.3 follows.


that c 0 = O
n
Proof of theorem 3.5. The method of proof for inverse covariance matrix is similar to covariance matrix estimation. We keep the notations similar
to that in proof of Theorem 3.2. Dene,
(8.2)

Q() = S 1 2 + 1 + tr(AD2 2tAD + t2 A)


2
2
2
0 S 1 2
0 1 tr(AD0 2tAD0 + t A)

where 0 is the true inverse covariance matrix and is any other covariance
matrix, A = diag(A11 , A22 , , App ), = U DU T and 0 = U0 D0 U0T be
eigenvalue decomposition of and 0 respectively where D and D0 are
diagonal matrices of eigenvalues and U and U0 are matrices of eigenvectors.
Let = 0 (dierence between any estimate and true inverse
covariance matrix 0 ). Dene the set of symmetric as: (M ) = { : =
minimizes the Q() or
T , F = M rn , 0 < M < }. The estimate

equivalently = 0 minimizes the G() = Q(0 + ) where G() is


is a solution to G(), then we have G()
G(0) = 0.
convex. Note that if
As argued in the Proof of Theorem 3.2, if we can show that G() is non lies within sphere of
negative for n (M ), this will imply that the

30

ASHWINI MAURYA

radius M rn . We require rn = (p+s)n log p 0 as n goes to . This will


give consistency of our estimate in Frobenius norm at rate O(rn ). On similar
S ,t,A, , we obtain
lines as in proof of Theorem 3.2, for (, ) R
2

)
C1 log p (

1
G() tr( ) 2 tr((S
0 )) +
H 1
H 1

C1 log p

{2k pF + 2F + 2 p+ F }
1
n
= {(i, j) :
where H be the index set as dened in Assumption B1 and H

(i, j) H, i, j = 1, 2, p}. Also H s F .


Consider the term involving S 1 ,
|tr(0 S 1 )| = |tr(S 1 (S 1
0 )0 )|

1 (S 1 )|tr((S 1
0 )0 )|

= 1 (S 1 )|tr((S 1
0 )0 )|

1 (S 1 )|tr((S 1
0 ))|1 (0 )
1
2

k |tr((S ))|.
0

by using a result on trace norm inequality from [31]. Now consider the term,
tr((S 1
0 )),
1
1
tr((S 1
0 )) = tr((S + I 0 )) = tr((S 0 ))) + tr()

( (p + s) log p
log p
+
F +
1
C1
n
n

)
p log p
+C1
F
n

holds with high probability by using a result (Lemma 1) from Ravikumar


et al. (2011) on the tail inequality of subgaussian random vectors where

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


31

log p/n and C1 is dened as in proof of Theorem 3.2. we have,

(
C1 log p
(p + s) log p +
2
2

G() F 1
) 2 k C1
F
1
n
n

log p ( 2 p(1 + k)
(1 + k2 )
C1
+ F + k2 s F + 2
F
n

1
[
C1 log p
(p + s)log p + 1
+ 2F 1
2 k2 C1
F
1
n
n

]
2C1 p log p
log p
1
1
+

(1 + k)
F 2C1
(1 + k2 )+ F
1
n
n

[
C1 log p
s log p 1
+ 2F 1
C1 k2
F
1
n
n

]
2C1 p log p
1

(1 + k2 ) F
1
n

2C1 (1+k)
2
[
2C1 (1 + k2 ) ]
C1 log p 2k C1 +
1
+ 2

F 1
1
n
M

[
2

]
C1 log p C1 k 2C1 (1 + k)
+ 2F 1

1
n
M
0
for all suciently large n and M . Hence the result.
Proof of Corollary 3.3. The proof of this Corollary is similar to Theorem 3.1 and hence omitted.
Proof of Corollary 3.4. The proof of this Corollary is similar to Corollary 3.2 and hence omitted.
Proof of Theorem 3.4. The proof of this Theorem is similar to Theorem 3.1 and hence omitted.
Proof of Theorem 3.6. The proof of this Theorem is similar to Theorem 3.3 and hence omitted.
8.1. Derivation of the Algorithm.
8.1.1. Covariance matrix estimation. The optimization problem (2.4)
can be written as:

32

ASHWINI MAURYA

(8.3)

arg min

f (),

S,t,A,
=T |(,)R
1

where
f () = ||

S||2F

+ 1 +

ai {i () t}2 .

i=1

Note that for a non-negative denite square matrix, singular values are the
same as its eigenvalues. We have the following trace identity:
Sum of eigenvalues of matrix = tr().
Let = U DU T where D is the diagonal matrix
of eigenvalues
and U2 is

=
orthogonal matrix of eigenvectors. We have pi=1 ai i2 () = pi=1 ai Dii
tr(AD2 ), where A = diag(a1 , a2 , ..., ap ). Again D = U T U = D2 =
DT D = U T T U = U T 2 U . Therefore
tr(AD) = tr(U AU T ) and
tr(AD2 ) = tr(AU T 2 U ) = tr(2 U AU T )
The third term in the right hand side of (8.3) can be written as:

ai {i () t} =
2

i=1

{ai i2 () 2t ai i () + ai t2 }

i=1

= tr(2 U AU T ) 2t tr(U AU T ) +

a i t2 ,

i=1

Therefore,
f () = S2F + 1
+ tr(2 U AU T ) 2t tr(U AU T ) +

a i t2

i=1

= tr( ) 2 tr( S) + tr(S S) + 1 + tr(2 U AU T )


2 t tr(U AU T ) + t2 tr(A)
= tr(2 (I + U AU T )) 2tr{(S + t U AU T )} + tr(S S)
+ 1 + t2 tr(A)
= tr(2 C) 2tr(B) + tr(S S) + 1 + t2 tr(A)
(
)
= tr{ 2 2BC 1 C} + tr(S S) + 1 + t2 tr(A)

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


33

where I is the identity matrix, C = I+ U AU T and B = S+ t U AU T . Note


that U AU T = U A1/2 A1/2 U T = (U A1/2 ) (U A1/2 ) is positive denite matrix.
Since is non negative, C is sum of two positive denite matrices, therefore
positive denite. Also C 1 = U (I + A)1 U T and 1 (C) 1 + maxip Aii .
Consider the term involving only ,
(
)
f1 () = tr{ 2 2BC 1 C} + 1
(
)
tr 2 2BC 1 1 (C) + 1
= BC 1 2F (1 + max Aii ) + 1
ip
(
)
= (1 + max Aii ) BC 1 2F + {/(1 + max Aii )} 1
ip

ip

= f2 ().
where
(
)
f2 () = (1 + max Aii ) BC 1 2F + {/(1 + max Aii )} 1 .
ip

ip

The function f2 () is convex in and therefore minimizer of f2 () is unique.


Note that for arbitrary choices of and , minimization of f2 () can yield an
non-positive denite estimator. However as argued earlier values of (, )
S,t,A, will yield a sparse and well conditioned positive denite estimator.
R
1
Clearly the minimum of f2 () is obtained for
)
(
(8.4)
sign(ij ) = sign(ji ) = sign (BC 1 )ij
we dierentiate the right side of f2 (), which yields,
( }
d
f () = (2) 2BA1 + {/(1 + max Aii )sign ) = 0.
ip

Using the optimality condition (8.4), we have,


ii = (BC 1 )ii ,

(8.5)

ij = (BC 1 )ij

sign(BC 1 )ij
2(1 + maxip Aii )

f or i = j

involves matrix of eigenvectors U . Since for a given


Note that the estimate
eigenvalue, the eigenvectors are not unique, we can choose some suitable
matrix of eigenvectors corresponding to some positive denite covariance
matrix. One choice is U = U1 where S + I = U1 D1 U1T for some > 0. Next
to check whether the solution of f2 () given by (9.3) is feasible, consider:

34

ASHWINI MAURYA

Case (i): ij 0. The solution (8.5) satises optimality condition (8.4)

if and only if (BC 1 )ij 2(1+ max


.
ip Aii )
Case (ii): ij < 0. As in Case (i), the solution (8.5) satises the opti
mality condition (8.4) if and only if (BC 1 )ij < 2(1+ max
.
ip Aii )
Note that BC 1 may not be symmetric. To get a symmetric estimate, we
make it symmetric as following:
M=

)
1(
BC 1 + (BC 1 )T
2

Combining these two cases, the optimal solution of (8.3) is given by:
ii = Mii .

(8.6)

{
(
)
ij = sign Mij max |Mij |

,0
2(1 + maxip Aii )

i = j,

where sign(x) is sign of x and |x| is absolute value of x.


Choice of U:
Note that U is the matrix of eigenvectors of , which is unknown. In practice, one can chose U as matrix of eigenvectors of corresponding eigenvalue
decomposition of S + I for some > 0 i.e. let S + I = U1 D1 U1T , then take
U = U1 .
Choice of and :
For given value of , we can nd the value of satisfying:
}
{
min (S)
< 2 (1 + min Aii )
+ 2 t min Aii 2,
ip
ip
1 + maxi Aii
and such choice of (, ) guarantees that the minimum eigenvalue of the
S,t,A, . In
estimate will be at least > 0 and such choice of (, ) R
1
practice one might choose a higher value of that corresponds to sparse
and positive denite covariance matrix.
8.2. Simulation Results.
8.2.1. Choice of weight matrix A:. For p > n, (p-n) sample eigenvalues
are identically equal to zero as well as many of the non-zero eigenvalues are
approximately zero. The simulation analysis shows that if we shrink each

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


35

eigenvalues towards a xed constant (i.e. same amount of shrinkage for each
of the sample eigenvalues), the smaller eigenvalues are shrunk upward heavily away from the true eigenvalues. Therefore we choose nonuniform weights
for eigenvalues to avoid over-shrinkage. Note that given a priori knowledge
of eigenvalues dispersion, one might be able to nd better weights. Here we
do not assume knowledge of any structure among eigenvalues and choose the
weights as per following scheme: (we assume all the eigenvalues are ordered
in decreasing order of magnitude.)
i) Let t=average(of sample eigenvalues). Let k be index such thatk th ordered eigenvalue is less than t. Let r = p/n, b1 = max(diag(S)) (1+ p/n)2 .
ii) For j=1 to p,
cj = bj (1 + .005 log(1 + r))|jk| , bj+1 = b2j /cj
iii)
A = diag(a1 , a2 , , ap ), where aj = cj /

cj .

j=1

where |x| is absolute value of x. Such choice of weights allows more shrinkage
of extreme sample eigenvalues than the ones in center of eigen-spectrum.
Choice of logarithmic term was to scale the weights but this is arbitrary
choice which has worked in our simulation setting.
The gure (8.1) shows the heatmap of zero recovery (sparsity) for block
and Toeplitz type covariance matrices based on 50 realizations for n=50 and
p=50. The JPEN estimate of covariance matrix recovers the true zeros for
about 80% for Toeplitz and block type of covariance matrices. Our proposed
estimator also reect the recovery of true structure of non-zero entries and
any pattern among the rows/columns of covariance matrix.

36

ASHWINI MAURYA

Fig 8.1. Heatmap of zeros identified in covariance matrix out of 50 realizations. Whitish
grid is 50/50 zeros identified, blackish grid is 0/50 zeros identified.

Table 8.1 gives average relative errors and standard errors of the covariance matrix estimates based on glasso, Ledoit-Wolf, PDSCE and JPEN for
n = 100 and p = 500, 1000. The glasso estimate of covariance matrix performs very poorly among all the methods. The Ledoit-Wolf estimate performs good but the estimate is generally not sparse. Also the eigenvalues
estimates of Ledoit-Wolf estimator is heavily shrunk towards the center than
the true eigenvalues. The JPEN estimators outperforms other estimators for
most of the values of p for all four type of covariance matrices. PDSCE estimates have lower average relative error and close to JPEN. This could be
due to the fact the PDSCE and JPEN uses quadratic optimization function
with a dierent penalty function. Table 8.2 reports the average relative error and their standard deviations for inverse covariance matrix estimation.
Here we do not include the Ledoit-Wolf estimator and only compare glasso,
PDSCE estimates with proposed JPEN estimator. The JPEN estimate of inverse covariance matrix outperforms other methods for all values of p = 500

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION


37

and p = 1000 for all four types of structured inverse covariance matrices.
8.2.2. Covariance Matrix Estimation. Table 8.1
Covariance matrix estimation
n=100

Ledoit-Wolf
Glasso
PDSCE
JPEN
n=100
Ledoit-Wolf
Glasso
PDSCE
JPEN

Hub type matrix


p=500
p=1000

Neighborhood type matrix


p=500
p=1000

1.07(0.165)
3.47(0.0477)
9.07(0.167)
10.2(0.022)
1.48(0.0709)
2.03(0.0274)
0.854(0.0808) 1.82(0.0273)
Block type matrix
4.271(0.0394) 2.18(0.11)
9.442(0.0438) 30.4(0.0875)
0.941(0.0418) 1.66(0.11)
0.887(0.0411) 1.66(0.11)

1.1(0.0331)
2.32(0.0262)
9.61(0.0366)
10.4(0.0238)
0.844(0.0331) 1.8(0.0263)
0.846(0.0332) 1.7(0.0263)
Toeplitz type matrix
1.967(0.041)
2.344(0.028)
12.741(0.051) 18.221(0.0398)
0.873(0.0415) 1.82(0.028)
0.707(0.0416) 1.816(0.0282)

8.3. Inverse Covariance matrix Estimation. Table 8.2


Inverse covariance matrix estimation
n=100
Glasoo
PDSCE
JPEN
n=100
Glasoo
PDSCE
JPEN

Hub type matrix


p=500
p=1000

Neighborhood type matrix


p=500
p=1000

9.82(0.0212)
10.9(0.0204)
1.13(0.0269)
2.07(0.0238)
0.138(0.0153) 0.856(0.0251)
Block type matrix
12.4(0.0266)
13.1(0.0171)
0.993(0.0375) 1.83(0.0251)
0.355(0.0319) 1.18(0.0258)

12.365(0.0176) 13.084(0.0178)
1.74(0.0549)
3.79(0.0676)
0.260(0.0234) 1.208(0.0277)
Toeplitz type matrix
19.3(0.0271)
20.7(0.0227)
1.89(0.0465)
3.79(0.0382)
1.24(0.0437)
3.18(0.0432)

Ashwini Maurya
Department of Statistics
and Probability
Michigan State University
East Lansing, MI 48824-1027
U. S. A.
E-mail: mauryaas@stt.msu.edu

Das könnte Ihnen auch gefallen