Beruflich Dokumente
Kultur Dokumente
arXiv: math.PR/0000000
ASHWINI MAURYA
a.s
l1 (1 + )2 ,
This suggests that l1 is not a consistent estimator of the largest eigenvalue
1 of population covariance matrix. In particular if n = p then l1 tends
to 4 whereas 1 is 1. This is also evident in the eigenvalue plot in gure
2.1. The distribution of l1 also depends upon the underlying structure of
the true covariance matrix. From gure 2.1, it is evident that the smaller
sample eigenvalues tend to underestimate the true eigenvalues for large p and
small n. For more discussion here see [Karoui (2008)]. To correct this bias,
a natural choice would be to shrink the sample eigenvalues towards some
suitable constant to reduce the over-dispersion. For instance, Stein (1975)
U
is a diagonal
=U
()
where ()
proposed an estimator of the form
matrix with diagonal entries as transformed function of sample eigenvalues
is matrix of eigenvectors. In another interesting paper Ledoit and Wolf
and U
(2004) proposed an estimator that shrinks the sample covariance matrix
towards the identity matrix. In another paper, Karoui (2008) proposed a
non-parametric estimation of spectrum of eigenvalues and show that his
estimator is consistent in sense of weak convergence of distributions.
The covariance matrix estimates based on eigen-spectrum shrinkage are
well conditioned in the sense that their eigenvalues are well bounded away
from zero. These estimates are based on the shrinkage of the eigenvalues and
therefore invariant under some orthogonal group i.e. the shrinkage estimators
shrink the eigenvalues but eigenvectors remain unchanged. In other words,
the basis (eigenvector) in which the data are given is not taken advantage
of and therefore the methods rely on premise that one will be able to nd
a good estimate in any basis. In particular, it is reasonable to believe that
the basis generating the data is somewhat nice. Often this translates into
the assumption that the covariance matrix has particular structure that one
should be able to take advantage of. In these situations, it becomes natural
to perform certain form of regularization directly on the entries of sample
covariance matrix.
Much of the recent literature focuses on two broad class of regularized covariance matrix estimation. i) The one class rely on natural ordering among
variables, where one often assumes that the variables far apart are weekly
correlated and ii) the other class where there is no assumption on the natural
ordering among variables. The rst class includes the estimators based on
banding and tapering [Bickel and Levina (2008), Cai et al. (2010)]. These
estimators are appropriate for a number of applications for ordered data
(time series, spectroscopy, climate data). However for many applications including gene expression data, priori knowledge of any canonical ordering is
not available and searching for all permutation of possible ordering would
not be feasible. In these situations, an 1 penalized estimator becomes more
appropriate which yields a permutation-invariant estimate.
To obtain a suitable estimate which is both well conditioned and sparse,
we introduce two regularization terms: i) 1 penalty to each of the odiagonal elements of matrix and, ii) squared deviation penalty to eigenvalues
from a suitable constant. The 1 minimization problems are well studied in
the covariance and inverse covariance matrix estimation literature [Freidman
ASHWINI MAURYA
et al. (2007), Banerjee et al. (2008), Bickel and Levina (2008), Ravikumar et
al. (2011), Jacob and Tibshirani (2011), Maurya (2014) etc.]. Meinshausen
and B
uhlmann (2006) studied the problem of variable selection using high
dimensional regression with lasso and show that it is a consistent selection
scheme for high dimensional graphs. Rothman et al. (2008) propose an 1 penalized log-likelihood estimator and show that their (estimator is consistent
)
ASHWINI MAURYA
2.1. Proposed Estimator. Let S be the sample covariance matrix. Consider the following optimization problem.
(2.1)
, = arg min
=T
p
[
]
|| S||22 + 1 +
ai {i () t}2 ,
i=1
, by .
For = 0, the standard lasso estimator for quadratic loss function and
its solution is (see 4 for derivation of this estimator):
ii = sii
(2.2)
)
(
ij = sign(sij ) max |sij | , 0 ,
i = j.
where sign(x) is sign of x and |x| is absolute value of x. It is clear from this
expression that a suciently large value of will result in sparse covariance
of (2.2) overcomes the
matrix estimate. But it is hard to assess whether
over-dispersion in the sample eigenvalues. The following eigenvalue plot (gure (2.1)) illustrates this phenomenon for a neighbourhood type (see 5 for
details on description of neighborhood type of matrix) of covariance matrix.
We simulated random vectors from multivariate normal distribution with
n = 50, p = 50.
= 1 (
1 +
T1 ) where
1 = (S + t U AU T )(I + U AU T )1 ,
may not be symmetric but is. To see if the estimate above is positive
ASHWINI MAURYA
1 ) = min (
T ), after some algebra, we have:
denite, since min (
1
= min (SU (I + A)1 U T + t U A(I + A)1 U T )
min ()
min (SU (I + A)1 U T ) + t min (U A(I + A)1 U T )
( Aii )
min (S)
+ t min
ip 1 + Aii
1 + maxip (Aii )
Aii
t min
>0
ip 1 + Aii
for minip Aii > 0. This means that the eigenvalues squared deviation
provided that >
penalty improves S to a positive denite estimator
0, t > 0, minip Aii > 0. Note that the estimator (2.3) is well conditioned
but need not be sparse. Sparsity can be achieved by imposing 1 penalty to
each entry of covariance matrix. Simulation experiments have shown that in
general the minimizer of (2.1) is not positive denite for all values of > 0
and > 0. To achieve both well conditioned and sparse positive denite
estimator we optimize the objective function of (2.1) over specic region
of values of (, ) which depends upon S, t, and A. The proposed JPEN
estimator of covariance matrix is given by:
=
(2.4)
arg min
S,t,A,
=T |(,)R
1
where
and
S,t,A,
R
1
p
[
]
|| S||F + 1 +
ai {i () t}2 ,
i=1
{
}
S,t,A, =
S,t,A, ) ,
R
(,
)
:
(,
,
)
R
1
1
>0
{
log p
min (S)
,
= (, , ) : > 0,
n 1 + maxip Aii
}
( Aii )
+ t min
max(1 + Aii )1 .
ip 1 + Aii
2 ip
S,t,A, , is some
The minimization in (2.4) over is for xed (, ) R
1
positive constant. Note that such choice of , guarantees the minimum
eigenvalue of the estimate in (2.4) to be at least > 0. Theorem 3.1 estab S,t,A, is asymptotic nonempty.
lishes that the set R
1
2.2. Our Contribution. The main contributions are the following:
i) The proposed estimator is both sparse and well conditioned simultaneously. This approach allows to take advantage of a prior structure if known
2 /2
Theorem 3.1. X := (X1 , X2 , , Xp ) be a mean zero subgaussian random vector as dened in (3.1). Let S = (1/n)XX T be the sample covariance
S,t,A, be as dened in (2.4).
matrix and np < 1 as n = n(p) . Let R
1
S,t,A, we have R
S,t,A, R, in probability, where
For (, ) R
1
1
1
{
}
R,
=
g()
>
,
1
>0
10
ASHWINI MAURYA
Assumption A2 guarantees that the true covariance matrix 0 is well conditioned (i.e. all the eigenvalues are nite and positive). A well conditioned
means that [Ledoit and Wolf (2004)] inverting the matrix does not explodes
the estimation error. Assumption A1 is more of a denition which says that
the number of non-zero o diagonal elements are bounded by some positive integer. The Theorem 3.2 below gives the rate of convergence of the
proposed covariance matrix estimator (2.4) in Frobenius norm.
S,t,A, and
be as dened in (2.4). Under
Theorem 3.2. Let (, ) R
1
Assumptions A0, A1, A2 and for min (0 ) t max (0 ), we have:
( (p + s)log p )
0 F = OP
(3.2)
n
Here the worst part of rate of convergence comes from estimating the
diagonal
(entries. For) correlation matrix estimation, the rate can be improved
to OP
s log p/n (Corollary 3.2).
Let 0 = W W be the variance correlation decomposition of true covariance matrix 0 where is true correlation matrix and W is the a diagonal
be the solution to following optimatrix of true standard deviations. Let K
mization problem.
=
(3.3) K
p
{
}
2 + K 1 +
K
ai {i (K)t}2
arg min
,t,A,
K=K T |(,)R
1a
i=1
,t,A,
where R
is given by:
1a
(3.4)
,t,A,
R
=
1a
>0
{
}
,t,A,
(, ) : (, , ) R
) ,
1a
and
min ()
log p
,t,A,
R1a
= (, , ) : > 0,
,
n 1 + maxip Aii
}
( Aii )
+ t min
max(1 + Aii )1 .
ip 1 + Aii
2 ip
is the sample counterpart of . Similar to Theorem 3.1, the following
and
,t,A,
corollary establishes that the set of symmetric dierence between R
1a
and its asymptotic counterpart R,
1a is empty as n = n(p) .
,t,A,
,t,A,
R1a
be as dened in (3.4). We have R
R,
1a
1a in probability,
where
}
{
2
R,
=
(1
>
1a
>0
,t,A,
max () and for (, ) R
,
1a
( s log p )
F = OP
(3.5)
K
.
n
Corollary 3.2.
The improved rate is due to the fact that for correlation matrix, all the
c := W
K
W
, where W
is a diagonal matrix
diagonal entries are one. Dene
of the estimates of true standard deviations based on observations. The
following theorem gives the rate of convergence of correlation matrix based
covariance matrix estimator in operator norm.
Theorem 3.3.
,t,A,
R1a
,
(3.6)
c 0 = OP
(s + 1)log p )
.
n
12
ASHWINI MAURYA
non-negative covariance matrix which is not the case for Rothmans (2012)
estimator (since Rothmans estimator involves logarithmic of determinant
of the estimator as another penalty to keep all the eigenvalues of estimated
matrix away from zero).
3.2. Estimation of Inverse Covariance Matrix. Notation: We shall use
for inverse covariance matrix.
Assumptions: We make following assumptions about the true inverse covariance matrix 0 . Let 0 = 1
0
B0. The random vector X := (X1 , X2 , , Xp ) is a mean zero vector with
covariance matrix 0 such that each Xi / 0ii has subgaussian distribution
with parameter as in (3.1).
B1. With H = {(i, j) : 0ij = 0, i = j}, the cardinality(H) s for some
positive integer s.
B2. There exist 0 < k < large enough such that (1/
k) min (0 )
large enough k. This condition is required in establishing the rate of convergence of estimator (3.7) (see the Theorem 3.5).
Dene the JPEN estimator of inverse covariance matrix 0 as the solution
to the following optimization problem,
(3.7)
p
[
]
arg min
S 1 2 + 1 +
ai {i () t}2
S
=T |(,)R
2
,t,A,
where
(3.8)
with
S ,t,A,
R
2
S ,t,A, =
R
2
i=1
>0
{
}
S ,t,A, ) ,
(, ) : (, , ) R
2
{
min (S 1 )
= (, , ) : > 0, logn p , 1+
maxip Aii
}
( A )
1 ,
ii
+ t minip 1+A
max
(1
+
A
)
ip
ii
2
ii
( (p + s)log p )
0 F = OP
(3.9)
n
Note that the rate of convergence here is the same as for the covariance
be the solution to following optimization problem:
matrix estimation. Let L
(3.10)
p
{
}
1 2
L + L 1 +
L=
arg min
ai {i (L) t}2
,t,A,
L=LT |(,)R
2a
i=1
1 = W
S 1 W
and
where
,t,A,
R
=
(3.11)
2a
>0
with
,t,A,
R
2a
{
}
,t,A,
(, ) : (, , ) R
)
,
2a
{
1 )
min (
= (, , ) : > 0, logn p , 1+
maxip Aii
}
( Aii )
1 ,
+ t minip 1+A
min
(1
+
A
)
ip
ii
2
ii
,t,A,
be as dened in (3.8). For (, )
< 1 as n = n(p) and R
2a
,t,A,
,t,A,
R
, we have R
R,
2a
2a
2a in probability, where
}
{
R,
=
g
()
,
2
2a
>0
14
ASHWINI MAURYA
,t,A,
tion B0, B1, B2 and for (, ) R
,
2a
( s log p )
1
F = OP
(3.12)
L
n
Corollary 3.4.
,t,A,
R2a
,
( (s + 1)log p )
c 0 = OP
(3.13)
n
c 0 F p
0 , the rate of convergence of the inverse
Since
covariance matrix based on inverse correlation matrix is same as that of the
covariance matrix estimator based on correlation matrix.
4. An Algorithm.
4.1. Covariance Matrix Estimation:. The optimization problem (2.4) can
be written as:
=
(4.1)
f (),
arg min
S,t,A,
=T |(,)R
1
where
f () = ||
S||2F
+ 1 +
ai {i () t}2 .
i=1
(4.2)
{
(
)
ij = sign Mij max |Mij |
,0 ,
2(1 + maxip Aii )
i = j,
where
M=
)
1(
M1 + M1T
2
with M1 = (S + t U AU T )(I + U AU T )1 ,
S,t,A, .
A = diag(A11 , A22 , , App ) with Aii = ai and (, ) R
1
Choice of U:
Note that U is the matrix of eigenvectors of , which is unknown. One choice
of U is matrix of eigenvectors of corresponding eigenvalue decomposition of
S + I for some > 0 i.e. let S + I = U1 D1 U1T , then take U = U1 .
Choice of and :
For given value of , we can nd the value of satisfying:
{
< 2 (1 + min Aii )
ip
}
min (S)
+ 2 t min Aii 2,
ip
1 + maxi Aii
ii = M
ii
{
(
)
ij = sign M max |M |
ij
ij
}
.
i = j.
Srobust F
16
ASHWINI MAURYA
50 100
5
10
time in seceonds
500
5000
JPEN
glasso
PDSCE
500
1000
2000
5000
number of covariates p
Fig 4.1. Timing comparison of JPEN, Graphical Lasso(Glasso), PDSCE on log-log scale.
5. Simulation Results.
We compare the performance of the proposed method to other existing
methods on simulated data for four types of structured covariance and inverse covariance matrices.
(i) Hub Graph: The rows/columns of 0 are partitioned into J equallysized disjoint groups: {V1 V2 , ..., VJ } = {1, 2, ..., p}, each group is
associated with a pivotal row k. Let size |V1 | = s. We set 0i,j = 0j,i =
for i Vk and 0i,j = 0j,i = 0 otherwise. In our experiment, J = [p/s], k =
1, s + 1, 2s + 1, ..., and we always take = 1/(s + 1) with J = 20.
18
ASHWINI MAURYA
Ledoit-Wolf
Glasso
PDSCE
JPEN
Ledoit-Wolf
Glasso
PDSCE
JPEN
2.13(0.103)
2.43(0.043)
10.8(0.06)
14.7(0.052)
1.22(0.052)
2.23(0.051)
1.74(0.051)
1.97(0.037)
Block type matrix
1.54(0.102)
2.96(0.0903)
30.8(0.0725)
33.9(0.063)
1.62(0.118)
3.08(0.0906)
1.01(0.101)
1.91(0.0909)
1.36(0.054)
2.89(0.028)
11.9(0.056)
14.3(0.03)
0.912(0.077)
1.85(0.028)
0.828(0.052)
1.66(0.028)
Toeplitz type matrix
1.967(0.041)
2.344(0.028)
12.741(0.051) 18.22(0.04)
0.873(0.042)
1.82(0.028)
0.707(0.042)
1.816(0.028)
Glasoo
PDSCE
JPEN
Glasoo
PDSCE
JPEN
13.4(0.057)
17.5(0.065)
1.12(0.046)
2.34(0.044)
0.613(0.033)
0.282(0.028)
Block type matrix
12.7(0.0406)
13.6(0.0316)
1.02(0.0562)
1.9(0.038)
0.372(0.0481) 0.579(0.0328)
12.694(0.03)
13.596(0.033)
0.958(0.04)
1.85(0.038)
0.392(0.038)
0.525(0.036)
Toeplitz type matrix
19.4(0.037)
20.7(0.022)
1.91(0.064)
3.7(0.037)
0.664(0.068)
2.42(0.045)
20
ASHWINI MAURYA
Fig 5.2. Eigenvalues plot for n = 20, p = 50 based on 50 realizations
6. Colon Tumor Classification Example. In this section, we compare performance of our proposed covariance matrix estimator for Linear
Discriminant Analysis (LDA) classication of tumors using gene expression data from Alon et al. (1999). In this experiment, colon adenocarcinoma tissue samples were collected, 40 of which were tumor tissues and
22 non-tumor tissues. Tissue samples were analyzed using an Aymetrix
oligonucleotide array. The data were processed, ltered, and reduced to a
subset of 2,000 gene expression values with the largest minimal intensity over
the 62 tissue samples (source: http://genomics-pubs.princeton.edu/oncology
/aydata/index.html ). Additional information about the dataset and its
pre-processing can be found in Alon et al. (1999). In our analysis, we reduce the number of genes by selecting p most signicant genes based on
logistic regression. We obtain estimates of inverse covariance matrix for
p = 50, 100, 200 and then use LDA to classify these tissues as either tumorous or non-tumorous (normal). We classify each test observation x to
either class k = 0 or k = 1 using the LDA rule
{
}
k + log(k ) .
k 1 k
(6.1)
k (x) = arg max xT
2
k
where k is the proportion of class k observations in the training data, k is
:=
1 is an estithe sample mean for class k on the training data, and
mator of the inverse of the common covariance matrix on the training data
computed by one of the methods under consideration. Tuning parameters
and were chosen using 5-fold cross validation. To create training and test
sets, we randomly split the data into a training set of size 42 and a testing
set of size 20; following the approach used by Wang et al. (2007), we require
the training set to have 27 tumor samples and 15 non-tumor samples. We
repeat the split at random 100 times and measure the average classication
error.
Table 6.1
Averages and standard errors of classification errors over 100 replications in %.
Method
Logistic Regression
SVM
Naive Bayes
Graphical Lasso
Joint Penalty
p=50
21.0(0.84)
16.70(0.85)
13.3(0.75)
10.9(1.3)
9.9(0.98)
p=100
19.31(0.89)
16.76(0.97)
14.33(0.85)
9.4(0.89)
8.9(0.93)
p=200
21.5(0.85)
18.18(0.96)
14.63(0.75)
9.8(0.90)
8.2(0.81)
Since we do not have separate validation set, we do the 5-fold cross validation on training data. At each split, we divide the training data into 5
subsets (fold) where 4 subsets are used to estimate the covariance matrix
and 1 subset is used to measure the classiers performance. For each split,
this procedure is repeated 5 times by taking one of the 5 subsets as validation data. An optimal combination of and is obtained by minimizing
the average classication error. Tuning parameter for graphical lasso was
obtained by similar criteria.
The average classication errors with standard errors over the 100 splits are
presented in Table 6.1. Since the sample size is less than the number of genes,
we omit the inverse sample covariance matrix as its not well dened and instead include the naive Bayes and support vector machine classiers. Naive
Bayes has been shown to perform better than the sample covariance matrix
in high-dimensional settings (Bickel and Levina (2004)). Support Vector
Machine(SVM) is another popular choice for high dimensional classication
tool (Chih-Wei Hsu et al. (2010)). Among all the methods covariance matrix
based based LDA classiers perform far better that Naive Bayes, SVM and
Logistic Regression. For all other classiers the classication performance
deteriorates for increasing p. For larger p i.e. when more genes are added to
the data set, the classication performance of JPEN estimate based LDA
classier improves which is dierent from Rothman et el. (2008) analysis of
same data set where the authors pointed out that as more genes are added
to the data set, the classiers performance deteriorates. Note that the classication error of a covariance matrix based classier initially decreases for
increasing p and deteriorates for large p. This is due to the fact that as dimension of covariance matrix increases, the estimator does not remain very
22
ASHWINI MAURYA
informative. In particular for p = 2000, when all the genes are used in data
analysis, the classication error of JPEN and glasso is about 30% which is
much higher than for p = 50.
7. Summary. We have proposed and analyzed regularized estimation
of large covariance and inverse covariance matrices using joint penalty. One
of its biggest advantages is that the optimization carries no computational
burden unlike many other methods for covariance regularization and the resulting algorithm is very fast, ecient and easily scalable to large scale data
analysis problems. We show that our estimators of covariance and inverse
covariance matrix are consistent in the Frobenius and operator norm. The
operator norm consistency guarantees consistency for principal components,
hence we expect that PCA will be one of the most important applications
of the method. Although the estimators in (2.4) and (3.7) do not require
any assumption on the structure of true covariance and inverse covariance
matrices respectively, but priori knowledge of any structure of true covariance matrix might be helpful to choose a suitable weight matrix and hence
improve estimation.
Acknowledgments
I would like to express my deep gratitude to Professor Hira L. Koul for his
valuable and constructive suggestions during the planning and development
of this research work.
References.
[1] Alon U., Barkai N., Notterman D., Gish K., Ybarra S., Mack D. and Levine A., Broad
patterns of gene expression revealed by clustering analysis of tumor and normal colon
tissues probed by oligonucleotide arrays. Proceeding of National Academy of Science
USA, 96(12):67456750, 1999.
[2] Banerjee O., El Ghaoui L. and dAspremont A., Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9,485-516, 2008.
[3] Bickel P. and Levina E., Regulatized estimation of large covariance matrices. The
Annals of Statistics, 36,199-227, 2008.
[4] Bickel P. and Levina E., Covariance regularization by thresholding The Annals of
Statistics, Volume 36, 2577-2604, 2008.
[5] Cai T., Zhang C. and Zhou H., Optimal rates of convergence for covariance matrix
estimation. The Annals of Statistics 38, 2118-2144, 2010.
[6] Cai T., Liu W. and Luo X., A constrained 1 minimization approach to sparse precision
matrix estimation. Journal of American Statistical Association 106, 594-607, 2011.
[7] Chaudhury S., Drton M. and Richardson T., Estimation of a covariance matrix with
zeros. Biometrica, Volume 94, Issue 1Pp. 199-216, 2007.
[8] Clarke R., Ressom H., Wang A., Xuan J., Liu M., Gehan E. and Wang Y., The
properties of high-dimensional data spaces: implications for exploring gene and protein
expression data. Nat Rev Cancer. Jan 2008; 8(1): 3749.
[9] Dempster A., covariance Selection. Biometrika, 32,95-108, 1972.
24
ASHWINI MAURYA
[33] Wang L., Zhu J. and Zou H., Hybrid huberized support vector machines for microarray classication. In ICML 07: Proceedings of the 24th International Conference on
Machine Learning, pages 983990, New York, NY, USA. ACM Press. 2007
[34] Xue L., Ma S. and Zou Hui, Positive-Denite l1-Penalized Estimation of Large Covariance Matrices. Journal of American Statistical Association, Theory and Methods,
Vol 107, No.500,2012.
[35] Yin Y. and Bai Z., Limit of the smallest eigenvalue of large dimensional sample
covariance matrix. The Annals of Probability Vol 21, No.3, 1275-1294,1993.
[36] Yuan M. and Lin Y., Model selection and estimation in the Gaussian graphical model.
Biometrika 94(1), 19-35,2007.
[37] Yuan M., Sparse inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research 11, 2261-2286, 2009.
8. Technical Proofs.
Proof of Theorem 3.1. Let = U DU T be the eigenvalue decomposition of . Let,
T +
f1 (D) = U DU T S2F + U DU
ai {i () t}2
1
1ip
T
= tr(D ) 2 tr(SU DU ) + tr(S ) + U DU
1
2
= 0 satises,
= (U SU T + t A)(I + A)1 /2U U T sign(U DU
T )(I + A)1 .
D
implies positive deniteness of
Positive deniteness of eigenvalues matrix D
Next we derive the lower bound on the smallest eigenvalue of D. Note
.
that
T )(I + A)1 } = max (I + A)1 =
max {U U T sign(U DU
1
1+ minip Aii .
Hence we obtain,
min {U SU T (I + A)1 } + t min {A(I + A)1 }
min (D)
2 1 + minip Aii
min (S)
Aii
+ t min
.
ip 1 + Aii
1 + maxip Aii
2 1 + minip Aii
For log p/n and log p/n, we have min (S) g() > 0 in
S,t,A, R,
probability by a theorem in [34]. Next we shall prove that R
1
1
in probability. Dene,
Y,,, =
min (S)
Aii
1
+ t min
ip 1 + Aii
1 + maxip Aii
2 1 + minip Aii
Since min (S) g(), therefore for given > 0, there exist a positive integer
N1 such that for all n = n(p) N1 ,
(
)
P |Y,,, g()| < 1 ,
S,t,A, R, = .
i.e. g() Y,,,xi g() + . Take 0, we have R
1
1
Hence the theorem.
Remark: Note that the above result is true in asymptotic sense under
assumption of Theorem 3.1. For nite samples when n < p, min (S) = 0
and because minip Aii > 0,
Aii
1 + Aii
2 1 + minip Aii
{
}
1
t min Aii
=
> 0,
ip
1 + t minip Aii
2
t min
min (D)
ip
ai {i () t}2 ,
i=1
where is the matrix with all the diagonal elements set to zero. Dene
the function Q(.) as following:
Q() = f () f (0 )
26
ASHWINI MAURYA
where 0 is the true covariance matrix and is any other covariance matrix.
Let = U DU T be eigenvalue decomposition of , D is diagonal matrix of
eigenvalues and U is matrix of eigenvectors. We have,
(8.1)
|tr((0 S))|
|ij (0ij Sij )| +
|ii (0ii Sii )|
i=1
i=j
max(|0ij Sij |) 1 +
i=j
p max(|0ii Sii |)
2ii
i=1
i=1
{ log p
p log p + }
C0 (1 + ) max(0ii )
1 +
2
i
n
n
{ log p
p log p + }
C1
1 +
2
n
n
holds with high probability by a result (Lemma 1) from Ravikumar et al.
(2011) on the tail inequality for sample covariance matrix of sub-gaussian
random vectors and where C1 = C0 (1 + ) maxi (0ii ), C0 > 0. Next we
obtain upper bound on the terms involving in (3.7). we have,
tr(AD2 2t AD) tr(AD02 2t AD0 )
= tr{A(U T 2 U U0T 20 U0 )} 2t tr{A(U T U U0T 0 U0 )}
2k p+ F + tr( ).
(ii)tr(A(U T U U0T 0 U0 )) 1 (A)tr( 0 )
tr{( + 0 ) 0 )} tr()
p+ F .
( +
1 0 1 )
0 1 0 1 ) = (E + 0 1 + E
(
1 0 1 )
0 1 E 1 + E
(
1 E 1 )
E
S,t,A, and
Let = (C1 /) log p/n, = (C1 /1 ) log p/n, where (, ) R
1
t k,
we obtain,
(1/k)
{ log p
p log p + }
( 1 ) +
F
G() tr( ) 2 C1
n
n
}
C1 log p {
2k pF + 2F + 2 p+ F
1
n
)
C1 log p (
+
E 1
E 1
C
log
p
p log p +
1
2F (1
) 2C1
F
1
n
n
) C1 log p (
)
log p (
2C1
E 1 +
E 1
E 1 +
E 1
n
2C1 log p
pF .
(1 + k)
Also because
s F ,
(i,j)E,i=j ij
E 1 =
log p
C1 log p
log p (
C1 )
2C1
E 1 +
E 1
E 1 2C1 +
n
n
n
log p
log p
E 2C1
s F
2C1
n
n
28
ASHWINI MAURYA
Therefore,
G()
C1 log p )
p log p +
F 1
2C1
F
1
n
n
)
2C1 p log p (
log p
+
1 + k)
F 2C1
s F
1
n
n
(
C1 log p )
(p + s)log p +
2
F
F 1
2 C1
1
n
n
(p + s) log p
(p + s) log p +
(1 + k)
2C1
F 2C1
F
n
1
n
[
C1 log p
(p + s) log p (
1 + k )]
+ 2
+ 1
F 1
2 F
C1 1 +
1
n
n
1
[
]
C1 log p
(p + s) log p
+ 2F 1
2C1 1
F
1
n
n
2C
(1+
k)
1
[
]
C1 log p 2C1 +
1
+ 2F 1
1
n
M
[
C1 log p 2C1 ]
+ 2F 1
1
n
M
0,
(
2
W K
W
W
W
W (KW
) + K
W
W .
+W
+ W
= O(1). Also,
Since = O(1), it follows from Corollary (3.2) that K
2 W 2 = max
W
x2 =1
i2 wi2 )|
|(w
i2 wi2 )|x2i max |(w
i=1
= max |(w
i2 wi2 )| = O
1ip
1ip
x2i
i=1
log p )
.
n
x2 =1
C3
i=1
(w
2 wi2 ) 2
| i
|xi
w
i + wi
x2 =1
|(w
i wi )|x2i = max
i=1
2 W 2 .
|(w
i2 wi2 )|x2i = C3 W
i=1
where we have used the fact that the true standard deviations are well above
zero, i.e., 0 < C3 < such that 1/C3 wi1 C3 i = 1, 2, , p, and
sample standard deviation are all positive, i.e, w
i > 0i = 1, 2, , p. Now
2
2
where 0 is the true inverse covariance matrix and is any other covariance
matrix, A = diag(A11 , A22 , , App ), = U DU T and 0 = U0 D0 U0T be
eigenvalue decomposition of and 0 respectively where D and D0 are
diagonal matrices of eigenvalues and U and U0 are matrices of eigenvectors.
Let = 0 (dierence between any estimate and true inverse
covariance matrix 0 ). Dene the set of symmetric as: (M ) = { : =
minimizes the Q() or
T , F = M rn , 0 < M < }. The estimate
30
ASHWINI MAURYA
)
C1 log p (
1
G() tr( ) 2 tr((S
0 )) +
H 1
H 1
C1 log p
{2k pF + 2F + 2 p+ F }
1
n
= {(i, j) :
where H be the index set as dened in Assumption B1 and H
1 (S 1 )|tr((S 1
0 )0 )|
= 1 (S 1 )|tr((S 1
0 )0 )|
1 (S 1 )|tr((S 1
0 ))|1 (0 )
1
2
k |tr((S ))|.
0
by using a result on trace norm inequality from [31]. Now consider the term,
tr((S 1
0 )),
1
1
tr((S 1
0 )) = tr((S + I 0 )) = tr((S 0 ))) + tr()
( (p + s) log p
log p
+
F +
1
C1
n
n
)
p log p
+C1
F
n
(
C1 log p
(p + s) log p +
2
2
G() F 1
) 2 k C1
F
1
n
n
log p ( 2 p(1 + k)
(1 + k2 )
C1
+ F + k2 s F + 2
F
n
1
[
C1 log p
(p + s)log p + 1
+ 2F 1
2 k2 C1
F
1
n
n
]
2C1 p log p
log p
1
1
+
(1 + k)
F 2C1
(1 + k2 )+ F
1
n
n
[
C1 log p
s log p 1
+ 2F 1
C1 k2
F
1
n
n
]
2C1 p log p
1
(1 + k2 ) F
1
n
2C1 (1+k)
2
[
2C1 (1 + k2 ) ]
C1 log p 2k C1 +
1
+ 2
F 1
1
n
M
[
2
]
C1 log p C1 k 2C1 (1 + k)
+ 2F 1
1
n
M
0
for all suciently large n and M . Hence the result.
Proof of Corollary 3.3. The proof of this Corollary is similar to Theorem 3.1 and hence omitted.
Proof of Corollary 3.4. The proof of this Corollary is similar to Corollary 3.2 and hence omitted.
Proof of Theorem 3.4. The proof of this Theorem is similar to Theorem 3.1 and hence omitted.
Proof of Theorem 3.6. The proof of this Theorem is similar to Theorem 3.3 and hence omitted.
8.1. Derivation of the Algorithm.
8.1.1. Covariance matrix estimation. The optimization problem (2.4)
can be written as:
32
ASHWINI MAURYA
(8.3)
arg min
f (),
S,t,A,
=T |(,)R
1
where
f () = ||
S||2F
+ 1 +
ai {i () t}2 .
i=1
Note that for a non-negative denite square matrix, singular values are the
same as its eigenvalues. We have the following trace identity:
Sum of eigenvalues of matrix = tr().
Let = U DU T where D is the diagonal matrix
of eigenvalues
and U2 is
=
orthogonal matrix of eigenvectors. We have pi=1 ai i2 () = pi=1 ai Dii
tr(AD2 ), where A = diag(a1 , a2 , ..., ap ). Again D = U T U = D2 =
DT D = U T T U = U T 2 U . Therefore
tr(AD) = tr(U AU T ) and
tr(AD2 ) = tr(AU T 2 U ) = tr(2 U AU T )
The third term in the right hand side of (8.3) can be written as:
ai {i () t} =
2
i=1
{ai i2 () 2t ai i () + ai t2 }
i=1
= tr(2 U AU T ) 2t tr(U AU T ) +
a i t2 ,
i=1
Therefore,
f () = S2F + 1
+ tr(2 U AU T ) 2t tr(U AU T ) +
a i t2
i=1
ip
= f2 ().
where
(
)
f2 () = (1 + max Aii ) BC 1 2F + {/(1 + max Aii )} 1 .
ip
ip
(8.5)
ij = (BC 1 )ij
sign(BC 1 )ij
2(1 + maxip Aii )
f or i = j
34
ASHWINI MAURYA
)
1(
BC 1 + (BC 1 )T
2
Combining these two cases, the optimal solution of (8.3) is given by:
ii = Mii .
(8.6)
{
(
)
ij = sign Mij max |Mij |
,0
2(1 + maxip Aii )
i = j,
eigenvalues towards a xed constant (i.e. same amount of shrinkage for each
of the sample eigenvalues), the smaller eigenvalues are shrunk upward heavily away from the true eigenvalues. Therefore we choose nonuniform weights
for eigenvalues to avoid over-shrinkage. Note that given a priori knowledge
of eigenvalues dispersion, one might be able to nd better weights. Here we
do not assume knowledge of any structure among eigenvalues and choose the
weights as per following scheme: (we assume all the eigenvalues are ordered
in decreasing order of magnitude.)
i) Let t=average(of sample eigenvalues). Let k be index such thatk th ordered eigenvalue is less than t. Let r = p/n, b1 = max(diag(S)) (1+ p/n)2 .
ii) For j=1 to p,
cj = bj (1 + .005 log(1 + r))|jk| , bj+1 = b2j /cj
iii)
A = diag(a1 , a2 , , ap ), where aj = cj /
cj .
j=1
where |x| is absolute value of x. Such choice of weights allows more shrinkage
of extreme sample eigenvalues than the ones in center of eigen-spectrum.
Choice of logarithmic term was to scale the weights but this is arbitrary
choice which has worked in our simulation setting.
The gure (8.1) shows the heatmap of zero recovery (sparsity) for block
and Toeplitz type covariance matrices based on 50 realizations for n=50 and
p=50. The JPEN estimate of covariance matrix recovers the true zeros for
about 80% for Toeplitz and block type of covariance matrices. Our proposed
estimator also reect the recovery of true structure of non-zero entries and
any pattern among the rows/columns of covariance matrix.
36
ASHWINI MAURYA
Fig 8.1. Heatmap of zeros identified in covariance matrix out of 50 realizations. Whitish
grid is 50/50 zeros identified, blackish grid is 0/50 zeros identified.
Table 8.1 gives average relative errors and standard errors of the covariance matrix estimates based on glasso, Ledoit-Wolf, PDSCE and JPEN for
n = 100 and p = 500, 1000. The glasso estimate of covariance matrix performs very poorly among all the methods. The Ledoit-Wolf estimate performs good but the estimate is generally not sparse. Also the eigenvalues
estimates of Ledoit-Wolf estimator is heavily shrunk towards the center than
the true eigenvalues. The JPEN estimators outperforms other estimators for
most of the values of p for all four type of covariance matrices. PDSCE estimates have lower average relative error and close to JPEN. This could be
due to the fact the PDSCE and JPEN uses quadratic optimization function
with a dierent penalty function. Table 8.2 reports the average relative error and their standard deviations for inverse covariance matrix estimation.
Here we do not include the Ledoit-Wolf estimator and only compare glasso,
PDSCE estimates with proposed JPEN estimator. The JPEN estimate of inverse covariance matrix outperforms other methods for all values of p = 500
and p = 1000 for all four types of structured inverse covariance matrices.
8.2.2. Covariance Matrix Estimation. Table 8.1
Covariance matrix estimation
n=100
Ledoit-Wolf
Glasso
PDSCE
JPEN
n=100
Ledoit-Wolf
Glasso
PDSCE
JPEN
1.07(0.165)
3.47(0.0477)
9.07(0.167)
10.2(0.022)
1.48(0.0709)
2.03(0.0274)
0.854(0.0808) 1.82(0.0273)
Block type matrix
4.271(0.0394) 2.18(0.11)
9.442(0.0438) 30.4(0.0875)
0.941(0.0418) 1.66(0.11)
0.887(0.0411) 1.66(0.11)
1.1(0.0331)
2.32(0.0262)
9.61(0.0366)
10.4(0.0238)
0.844(0.0331) 1.8(0.0263)
0.846(0.0332) 1.7(0.0263)
Toeplitz type matrix
1.967(0.041)
2.344(0.028)
12.741(0.051) 18.221(0.0398)
0.873(0.0415) 1.82(0.028)
0.707(0.0416) 1.816(0.0282)
9.82(0.0212)
10.9(0.0204)
1.13(0.0269)
2.07(0.0238)
0.138(0.0153) 0.856(0.0251)
Block type matrix
12.4(0.0266)
13.1(0.0171)
0.993(0.0375) 1.83(0.0251)
0.355(0.0319) 1.18(0.0258)
12.365(0.0176) 13.084(0.0178)
1.74(0.0549)
3.79(0.0676)
0.260(0.0234) 1.208(0.0277)
Toeplitz type matrix
19.3(0.0271)
20.7(0.0227)
1.89(0.0465)
3.79(0.0382)
1.24(0.0437)
3.18(0.0432)
Ashwini Maurya
Department of Statistics
and Probability
Michigan State University
East Lansing, MI 48824-1027
U. S. A.
E-mail: mauryaas@stt.msu.edu