Sie sind auf Seite 1von 38

Article

Low Dimensionality in Gene Expression Data


Enables the Accurate Extraction of Transcriptional
Programs from Shallow Sequencing
Graphical Abstract Authors
Graham Heimberg, Rajat Bhatnagar,
Hana El-Samad, Matt Thomson

Correspondence
hana.el-samad@ucsf.edu (H.E.-S.),
matthew.thomson@ucsf.edu (M.T.)

In Brief
We develop a mathematical framework
that delineates how parameters such as
read depth and sample number influence
the error in transcriptional program
extraction from mRNA-sequencing data.
Our analyses reveal that gene expression
modularity facilitates low error at
surprisingly low read depths, arguing that
increased multiplexing of shallow
sequencing experiments is a viable
approach for applications such as single-
cell profiling of entire tumors.

Highlights
d Mathematical model reveals impact of mRNA-seq read depth
on gene expression analysis

d Modularity in gene expression facilitates robust


transcriptional program extraction

d Model suggests dramatic increases in sample multiplexing


for many applications

d Read depth calculator determines parameters for optimal


experimental design

Heimberg et al., 2016, Cell Systems 2, 239250


April 27, 2016 2016 The Authors
http://dx.doi.org/10.1016/j.cels.2016.04.001
Cell Systems

Article

Low Dimensionality in Gene Expression Data


Enables the Accurate Extraction of Transcriptional
Programs from Shallow Sequencing
Graham Heimberg,1,2,3,4,5 Rajat Bhatnagar,1,3,5 Hana El-Samad,1,3,* and Matt Thomson3,4,*
1Department of Biochemistry and Biophysics, California Institute for Quantitative Biosciences, University of California, San Francisco, San

Francisco, CA 94158, USA


2Integrative Program in Quantitative Biology, University of California, San Francisco, San Francisco, CA 94158, USA
3Center for Systems and Synthetic Biology, University of California, San Francisco, San Francisco, CA 94158, USA
4Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA 94158, USA
5Co-first author

*Correspondence: hana.el-samad@ucsf.edu (H.E.-S.), matthew.thomson@ucsf.edu (M.T.)


http://dx.doi.org/10.1016/j.cels.2016.04.001

SUMMARY Not all biological questions require such extreme technical


sensitivity. For example, a catalog of human cell types and
A tradeoff between precision and throughput con- the transcriptional programs that define them can potentially
strains all biological measurements, including seq- be generated by querying the general transcriptional state of
uencing-based technologies. Here, we develop a single cells (Trapnell, 2015). In principle, theoretical and
mathematical framework that defines this tradeoff computational methods could elucidate the tradeoff between
between mRNA-sequencing depth and error in the sequencing depth and granularity of the information that can
be accurately extracted from samples. Accordingly, opti-
extraction of biological information. We find that
mizing this tradeoff based on the granularity required by the
transcriptional programs can be reproducibly identi-
biological question at hand would yield significant increases
fied at 1% of conventional read depths. We demon- in the scale at which mRNA-seq can be applied, facilitating
strate that this resilience to noise of shallow applications such as drug screening and whole-organ or tu-
sequencing derives from a natural property, low mor profiling.
dimensionality, which is a fundamental feature of The modern engineering discipline of signal processing has
gene expression data. Accordingly, our conclusions demonstrated that structural properties of natural signals can
hold for 350 single-cell and bulk gene expression often be exploited to enable new classes of low cost measure-
datasets across yeast, mouse, and human. In total, ments. The central insight is that many natural signals are
our approach provides quantitative guidelines for effectively low dimensional. Geometrically, this means
the choice of sequencing depth necessary to achieve that these signals lie on a noisy, low-dimensional manifold
embedded in the observed, high-dimensional measurement
a desired level of analytical resolution. We codify
space. Equivalently, this property indicates that there is a
these guidelines in an open-source read depth calcu-
basis representation in which these signals can be accurately
lator. This work demonstrates that the structure captured by a small number of basis vectors relative to the orig-
inherent in biological networks can be productively inal measurement dimension (Donoho, 2006; Candes et al.,
exploited to increase measurement throughput, an 2006; Hinton and Salakhutdinov, 2006). Modern algorithms
idea that is now common in many branches of sci- exploit the fact that the number of measurements required to
ence, such as image processing. reconstruct a low-dimensional signal can be far fewer than the
apparent number of degrees of freedom. For example, in images
of natural scenes, correlations between neighboring pixels
INTRODUCTION induce an effective low dimensionality that allows high-accuracy
image reconstruction even in the presence of considerable mea-
All measurements, including biological measurements, contain a surement noise such as point defects in many camera pixels
tradeoff between precision and throughput. In sequencing-based (Duarte et al., 2008).
measurements like mRNA-sequencing (mRNA-seq), precision is Like natural images, it has long been appreciated that biolog-
determined largely by the sequencing depth applied to individual ical systems contain structural features that can lead to an
samples. At high sequencing depth, mRNA-seq can detect subtle effective low dimensionality in data. Most notably, genes are
changes in gene expression including the expression of rare commonly co-regulated within transcriptional modules; this pro-
splice variants or quantitative modulations in transcript abun- duces covariation in the expression of many genes (Eisen et al.,
dance. However, such precision comes at a cost, and sequencing 1998; Segal et al., 2003; Bergmann et al., 2003). The widespread
transcripts from 10,000 single cells at deep sequencing coverage presence of such modules indicates that the natural dimension-
(106 reads per cell) currently requires 2 weeks of sequencing on an ality of gene expression is determined not by the number of genes
Illumina HiSeq 4000. in the genome but by the number of regulatory modules. By

Cell Systems 2, 239250, April 27, 2016 2016 The Authors 239
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Figure 1. A Mathematical Model Reveals Factors Determining the Performance of Shallow mRNA-Seq
(A) mRNA-seq throughput as a function of sequencing depth per sample for a fixed sequencing capacity.
(B) Unsupervised learning techniques are used to identify transcriptional programs. We ask when and why shallow mRNA-seq can accurately identify tran-
scriptional programs.
(C) Decreasing sequencing depth adds measurement noise to the transcriptional programs identified by unsupervised learning. Our approach reveals that
dominant programs, defined as those that explain relatively large variances in the data, are tolerant to measurement noise.

analogy to signal processing, this natural structure suggests that Our analysis reveals that the dominance of a transcriptional
the lower effective dimensionality present in gene expression program, quantified by the fraction of the variance it explains
data can be exploited to make accurate, inexpensive measure- in the dataset, determines the read depth required to accu-
ments that are not degraded by noise. But when, and at what rately extract it. We demonstrate that common bioinformatic
error tradeoff, can low dimensionality be leveraged to enable analyses can be performed at 1% of traditional sequencing
low-cost, high-information-content biological measurements? depths with little loss in inferred biological information at the
Here, inspired by these developments in signal processing, level of transcriptional programs. We also introduce a simple
we establish a mathematical framework that addresses the read depth calculator that determines optimal experimental
impact of reducing coverage depth, and hence increasing mea- parameters to achieve a desired analytical accuracy. Our
surement noise, on the reconstruction of transcriptional regula- framework and computational results highlight the effective
tory programs from mRNA-seq data. Our framework reveals low dimensionality of gene expression, commonly caused by
that shallow mRNA-seq, which has been proposed to in- co-regulation of genes, as both a fundamental feature of
crease mRNA-seq throughput by reducing sequencing depth biological data and a major underpinning of biological sig-
in individual samples (Jaitin et al., 2014; Pollen et al., 2014; Klie- nals tolerance to measurement noise (Figures 1B and 1C). Un-
benstein, 2012) (Figure 1A), can be applied generally to many derstanding the fundamental limits and tradeoffs involved in
bulk and single-cell mRNA-seq experiments. By investigating extracting information from mRNA-seq data will guide re-
the fundamental limits of shallow mRNA-seq, we define the searchers in designing large-scale bulk mRNA-seq experi-
conditions under which it has utility and complements deep ments and analyzing single-cell data where transcript coverage
sequencing. is inherently low.

240 Cell Systems 2, 239250, April 27, 2016


RESULTS Formally, the principal components are defined as the eigen-
vectors of the gene expression covariance matrix, and the prin-
Statistical Properties of Gene Expression Data cipal values li are the associated eigenvalues that equal the
Determine the Accuracy of Principal Component variance of the data projected onto the component (Alter
Analysis at Low Read Depth et al., 2000; Holter et al., 2001). We use perturbation theory
To delineate the impact of sequencing depth on the analysis of to model how the eigenvectors of the gene expression covari-
mRNA-seq data, we developed a mathematical framework that ance matrix change when measurement noise is added (Stew-
models the performance of a common bioinformatics tech- art and Sun, 1990; Shankar, 2012). We perform our analysis
nique, transcriptional program identification, at low sequencing in units of normalized read counts for conceptual clarity (or
depth. We focus on transcriptional program identification as it normalized transcript counts where appropriate), but an iden-
is central in many analyses including gene set analysis, network tical analysis and error equation can be derived in FPKM units
reconstruction (Holter et al., 2001; Bonneau, 2008), and cancer through a simple rescaling. The principal component error is
classification (Alon et al., 1999; Shai et al., 2003; Patel et al., defined as the deviation between the deep (pci) and shallow
2014), as well as the analysis of single-cell mRNA-seq data. dpci principal components,
Our model defines exactly how reductions in read depth v
corrupt the extracted transcriptional programs and determines u 0   12
uX pcT C b  C pcj
the precise depth required to recover them with a desired u i
pci k zt @
kpci  d A (Equation 1)
accuracy. jsi
li  lj
Our analysis focuses on the identification of transcriptional
programs from mRNA-seq data through principal component where C and C ^ are the covariance matrices obtained from deep
analysis (PCA), because of its prevalence in gene expression and shallow mRNA-seq data, respectively. Equation 1 can be
analysis (Alter et al., 2000; Ringner, 2008) and its fundamental used to model the impact of shallow sequencing on any given
similarities to other commonly used methods. A recent review mRNA-seq dataset. Moreover, qualitative analysis of the equa-
called PCA the most widely used method for unsupervised clus- tion reveals the key factors that determine whether low depth
tering and noted that it has already been successfully applied in profiling will accurately identify transcriptional programs. As
many single-cell genomics contexts (Trapnell, 2015). Addition- expected, this equation indicates that the principal component
ally, research in the computer science community over the error depends on generic features including read depth and
last decade has shown that many other unsupervised learning sample number, as these affect the difference between the
methods, including k-means, spectral clustering, and Locally shallow and deep covariance matrices in the numerator of Equa-
Linear Embedding, are naturally related to PCA or its generaliza- tion 1 (see the Supplemental Information, section 2.1). However,
tion, Kernel PCA (Ding and He, 2004; Ng et al., 2001; Ham et al., Equation 1 also reveals that the principal component error de-
2004; Bengio et al., 2004). Because of the deep connection pends on a system-specific property: the relative magnitude of
between PCA and other unsupervised learning techniques, the principal values (captured by li  lj). Since the principal
we expect that our conclusions in this section will extend to other values correspond to the variance in the data along a principal
methods of analysis (and we provide such parallel analysis in the component, this term quantifies whether the information in the
Supplemental Information). Here, we focus on PCA because the gene expression data is concentrated among a few transcrip-
well-defined theory behind it provides a unique opportunity to tional programs. When genes covary along a small number of
understand, analytically, the factors that determine the robust- principal axes, the dataset has an effective low dimensionality,
ness of program identification to low-coverage sequencing i.e., the data are concentrated on a low-dimensional sub-space,
noise. and transcriptional programs can be extracted even in the pres-
PCA identifies transcriptional programs by extracting groups ence of sequencing noise.
of genes that covary across a set of samples. Covarying genes
are grouped into a gene expression vector known as a principal Mouse Tissues Can Be Distinguished at Low Depth in
component. Principal components are weighted by their relative Bulk mRNA-Seq Samples
importance in capturing the gene expression variation that oc- To understand the implications of this result in the context of
curs in the underlying data. Decreasing sequencing depth intro- an established mRNA-seq dataset, we applied Equation 1 to
duces measurement noise into the gene expression data and a subset of the mouse ENCODE data that uses deep mRNA-
corrupts the extracted principal components. seq (>107 reads per sample) to profile gene expression of
If the transcriptional programs obtained from shallow mRNA- 19 different mouse tissues with a biological replicate (Shen
seq data and deep mRNA-seq data are similar, then we can et al., 2012) (see the Experimental Procedures). The analysis
accurately perform many gene expression analyses at low depth revealed that the leading, dominant transcriptional programs
while collecting data in much higher throughput (Figure 1). We could be extracted with <1% of the studies original read
therefore developed a mathematical model that quantifies how depth. Specifically, the first three principal components could
the principal components computed at low and high sequencing be recovered with >80% accuracy (i.e., an error of 1  0.8 =
depths differ. The model reveals that performance of transcrip- 20%) with just 55,000 reads per experiment (Figures 2A and
tional program extraction at low read depth is specific to the da- S1A). To reach 80% accuracy for all of the first nine principal
taset and even the program itself. It is the dominant transcrip- components, only 145,000 reads were needed (Figure S1B).
tional programs, which capture most variance, that are the Increasing read depth further had diminishing returns for prin-
most stable. cipal component accuracy. To increase the accuracy of the

Cell Systems 2, 239250, April 27, 2016 241


Figure 2. Transcriptional States of Mouse Tissues Are Distinguishable at Low Read Coverage
(A) Principal component error as a function of read depth for selected principal components for the Shen et al. (2012) data. For first three principal components,
1% of the traditional read depth is sufficient for achieving >80% accuracy. Improvements in error exhibit diminishing returns as read depth is increased. Less
dominant transcription programs (principal components 8 and 15 shown) are more sensitive to sequencing noise.
(B) Variance explained by transcriptional program (blue) and differences between principal values (green) of the Shen et al. (2012) data. The leading, dominant
transcriptional programs have principal values that are well separated from later principal values, suggesting that these should be more robust to measurement
noise.
(C) GSEA significance for the top ten terms of principal component two (top) and three (bottom) as a function of read depth. 32,000 reads are sufficient to recover
all top ten terms in the first three principal components. (Analysis for first principal component shown in Figure S1C.)
(D) Projection of a subset of the Shen et al. (2012) tissue data onto principal components two and three. The ellipses represent uncertainty at specific read depths.
Similar tissues lie close together. Transcriptional program two separates neural tissues from non-neural tissues while transcriptional program three distinguishes
tissues involved in hematopoiesis from other tissues. This is consistent with the GSEA of these transcriptional programs in (C).

first three principal components an additional 5% (from 80% million reads, 140 times more than the first principal component,
to 85%), 55% more reads were required. We confirmed these to be recovered with the same 80% accuracy.
analytical results by simulating shallow mRNA-seq through To explore whether the shallow principal components also
direct sub-sampling of reads from the raw dataset (see the retained the same biological information as the programs
Experimental Procedures). computed from deep mRNA-seq data, we compared results
Further, as predicted by Equation 1, the dominant principal from Gene Set Enrichment Analysis applied to shallow and
components were more robust to shallow sequencing noise deep mRNA-seq data. At a read depth of 107 reads per sample,
than the trailing, minor principal components. This is a direct the first three principal components have many significant func-
consequence of the fact that the leading principal values are tional enrichments with the second and third principal compo-
well separated from other principal values, while the trailing nents enriched for neural and hematopoietic processes, respec-
values are spaced closely together. For instance, l1 is separated tively (Figure 2C; see Figure S1C for first principal component).
from other principal values by at least l1  l2 = 5 3 106, more These functional enrichments corroborate the separation seen
than two orders of magnitude greater than the minimum separa- when the gene expression profiles from each tissue are pro-
tion of l25 from other principal values (1.5 3 108) (Figure 2B). jected onto the second and third principal components (see
Therefore, the 25th principal component requires almost four the Experimental Procedures). Neural tissues (cerebellum,

242 Cell Systems 2, 239250, April 27, 2016


cortex, olfactory, and embryonic day 14.5 [E14.5] brain) project classes of pyramidal neurons with similar gene expression pro-
along the second principal component while the hematopoietic files and oligodendrocytesthat are transcriptionally distinct.
tissues (spleen, liver, thymus, bone marrow, and E14.5 liver) As the first three principal values were well separated from the
project along the third principal component (Figure 2D). others (Figure S2A), Equation 1 estimated that the first three prin-
The statistically significant enrichments of the first three prin- cipal components could be reconstructed with 11%, 22%, and
cipal components persisted at low sequencing depths. At 38% error, respectively, with just 1,000 transcripts per cell
<32,000 reads per sample, only 0.37% of the total reads, all (Figure 3A).
ten of the top gene sets for these principal components passed We confirmed this result computationally. With just 100 unique
our significance threshold of p < 104 (negative predictive value transcripts, we were able to separate oligodendrocytes from
and positive predictive value in Figures S1D and S1E). To put this the two classes of pyramidal neurons with >90% accuracy.
result in perspective, using only 32,000 reads per sample (corre- With 1,000 unique transcripts per cell, we were able to distin-
sponding to PCA accuracies of 81%, 79%, and 75% for the first guish pyramidal neurons of the hippocampus from those of cor-
three principal components, respectively) would allow a faithful tex with the same >90% accuracy (Figure 3B). The different
recapitulation of functional enrichments while still multiplexing depths required to distinguish these subclasses of neural and
thousands of samples, rather than dozens, in a single Illumina non-neural cell-types reflect the differing robustness of the cor-
HiSeq sequencing lane. Additionally, this low number of reads responding principal components. The first principal component
was still sufficient to separate the different cell types (Figure 2D). captures a broad distinction between oligodendrocytes and py-
We obtained similar results when working in FPKM units, sug- ramidal cell types (Figure 3C, left) and is the most robust to low
gesting that the broad conclusions of our analysis are insensitive read depths. The third principal component captures a more
to gene expression units (Figures S1F, S1G, and S1H). fine-grained distinction between pyramidal neurons but is less
robust than the first principal component at low read depth
Transcriptional States in Single Cells Are and hence requires more coverage. This is consistent with bio-
Distinguishable with Less Than 1,000 Transcripts logical intuition: more depth is required to distinguish between
per Cell pyramidal neural subtypes than between oligodendrocytes and
We wanted to explore whether shallow mRNA-seq could also pyramidal neurons.
capture gene expression differences between individual single We next asked how contributions of individual genes to a prin-
cells within a heterogeneous tissue, arguably a more challenging cipal component change as a function of read depth. For every
problem than distinguishing different bulk tissue samples. In principal component, we derived a null model consisting of the
addition to the biological importance of quantifying variability distribution of the individual gene weightings, called loadings,
at the single-cell level, single-cell mRNA-seq data provide the from a shuffled version of the data (see the Experimental Proce-
necessary context for analyzing the performance of shallow dures). Comparing the data to the null model, we found that at a
sequencing for two reasons. First, single-cell mRNA-seq exper- depth of 340 transcripts, >80% of genes significantly associ-
iments are inherently low-depth measurements as current ated with the first principal component could still be detected
methods can capture only a small fraction (20%) (Shalek (Figures 3C and 3D; Experimental Procedures). At just 100 tran-
et al., 2014) of the 300,000 transcripts (Velculescu et al., scripts per cell, we were still able to identify oligodendroycte
1999) typically contained in individual cells. Second, since ad- markers, such as myelin-associated oligodendrocyte basic pro-
vances in microfluidics (Macosko et al., 2015) now facilitate the tein (Mobp) and myelin-associated glycoprotein (Mag), as well as
automated preparation of tens of thousands of individual cells neural markers, such as Neuronal differentiation 6 (Neurod6) and
for single-cell mRNA-seq, sequencing requirements impose a Neurogranin (Nrgn), as statistically significant, and reliably clas-
key bottleneck on the further scaling of single-cell throughput. sify these distinct cell types. However, below 100 transcripts per
To probe the impact of sequencing depth reductions on sin- cell, cell-type classification becomes inaccurate, and this is
gle-cell mRNA-seq data, we analyzed a dataset characterizing correlated with markers such as Neurod6 being no longer statis-
3,005 single cells from the mouse cerebral cortex and hippocam- tically associated with the first principal component.
pus (Zeisel et al., 2015) that were classified bioinformatically at We were able to reach similar conclusions with three other
full sequencing depth (average of 15,000 unique transcripts single-cell mRNA-seq datasets (Shalek et al., 2013; Treutlein
per cell) into nine different neural and non-neural cell types. In et al., 2014; Kumar et al., 2014). With similarly low sequencing
addition to providing a rich biological context for analysis, this depths, we were able to distinguish transcriptional states of sin-
dataset allows for a quantitative analysis of low-depth transcrip- gle cells collected across stages of the developing mouse lung
tional profiling as it incorporates molecular barcodes known as (Figures S2BS2D), wild-type mouse embryonic stem cells
unique molecular identifiers (UMIs) that enable the precise from stem cells with a single gene knockout (Figures S2E
counting of transcripts from each single cell. The Zeisel et al. S2G), and heterogeneity within a population of bone-marrow-
(2015) data therefore allowed us to analyze the impact of derived dendritic cells (Figures S2HS2J). These results were
sequencing depth reductions quantitatively in units of transcript also not PCA-specific. We additionally examined two of these
counts rather than in the less precise unit of raw sequencing datasets with t-distributed Stochastic Neighbor Embedding
reads. (t-SNE) and Locally Linear Embedding (LLE), two nonlinear alter-
Similarly to the bulk tissue data, we found that leading prin- natives to PCA (Van der Maaten and Hinton, 2008; Roweis and
cipal components in single cells could be reconstructed with a Saul, 2000) and achieved successful classification of transcrip-
small fraction of the total transcripts collected in the raw dataset. tional states (Figures S2K and SKL), in each case recapitulating
We focused our analysis on three classes of cell typestwo the results of the original studies with fewer than 5,000 reads per

Cell Systems 2, 239250, April 27, 2016 243


Figure 3. Transcriptional States of Single Cells in the Mouse Brain Are Distinguishable at Low Transcript Coverage
(A) Principal component error as a function of read depth for selected principal components for the Zeisel et al. (2015) data.
(B) Accuracy of cell type classification as a function of transcripts per cell. Accuracy plateaus with increasing transcript coverage. At 1,000 transcripts per cell, all
three cell types can be distinguished with low error. At 100 transcripts per cell, pyramidal cells cannot be distinguished from each other, while oligodendrocytes
remain distinct.
(C) Covariance matrix of genes with high absolute loadings in the first principal component (left). The genes with the 100 highest positive and 100 lowest negative
loadings are displayed. The first principal component is enriched for genes indicative of oligodendrocytes and neurons (middle). Gene significance as a function of
transcript count for the first principal component (right).
(D) True and false detection rates as a function of transcript count for genes significantly associated with the first three principal components. Below 100
transcripts per cell, false positives are common.

cell. These results suggest that low dimensionality enables high ance, increases with the covariance of genes within the associ-
accuracy classification at low read depth across many methods. ated module (Figure 4A) and also the number of genes in the
module (Figures S3AS3C). While highly expressed genes
Gene Expression Covariance Induces Tolerance to also contribute to noise tolerance, in the Shen et al. (2012) da-
Shallow Sequencing Noise taset we found little correlation between the expression level of
In the datasets we considered, the dominant noise-robust prin- a gene and its contribution to the error of the first principal
cipal components corresponded directly to large modules of component (R2 = 0.13; Figure S3D).
covarying genes. Such modules are common in gene expres- This analysis predicts that the large groups of tightly co-
sion data (Eisen et al., 1998; Alter et al., 2000; Bergmann varying genes observed in the Shen et al. (2012) and Zeisel
et al., 2003; Segal et al., 2003). We therefore studied the contri- et al. (2015) datasets will contribute significantly to principal
bution of modularity to principal component robustness in a value separation and noise tolerance. To directly quantify the
simple, mathematical model of gene expression (Supplemental contribution of covariance to principal value separation in these
Information, section 2.2). Our analysis showed that the variance data, we randomly shuffled the sample labels for each gene. In
explained by a principal component, and hence its noise toler- the shuffled data, genes vary independently, which eliminates

244 Cell Systems 2, 239250, April 27, 2016


Figure 4. Modularity of Gene Expression Enables Accurate, Low-Depth Transcriptional Program Identification
(A) Variance explained and covariance matrix for increasing gene expression covariance in a model.
(B) Variance explained by different principal components for the Zeisel et al. (2015) dataset. Covariance matrix shows large modules of covarying genes (middle).
Dominant transcriptional programs are robust to low-coverage profiling as predicted by model (bottom). Shuffling the dataset destroys the modular structure,
resulting in noise-sensitive transcriptional programs. For the shuffled data, 4,250 transcripts are required for 80% accuracy of the first three principal compo-
nents, whereas 340 transcripts suffices for the original dataset.

gene-gene covariance and raises the effective dimensionality if a common phenomenon, suggests that shallow mRNA-seq
of the data. In contrast to the natural, low-dimensional data, may be rigorously employed when answering many biological
the principal values of the resulting data were nearly uniform questions. To assess whether our findings are broadly appli-
in magnitude. This significantly diminished the differences be- cable, we performed a broad computational survey of available
tween the leading principal values within the shuffled data (Fig- gene expression data.
ure 4B, top). Since both gene covariances and principal values are funda-
Consequently, reconstruction of the principal components mental properties of the biological systems under study, these
became more read-depth intensive. For instance to recover quantities may be analyzed using the wealth of microarray data-
the first principal component with 80% accuracy from the shuf- sets available, leveraging a larger collection of gene expression
fled Zeisel et al. (2015) data, 12.5 times more transcripts are datasets as compared to mRNA-seq (see Figure S5A for ana-
required than for the unshuffled data (Figure 4B, bottom). We lyses of several mRNA-seq datasets). We selected 352 gene
reached a similar conclusion for the mouse ENCODE data, expression datasets from the GEO (Edgar et al., 2002) spanning
where shuffling also decreased the differences between the three species (yeast, 20 datasets; mouse, 106 datasets; and hu-
leading principal values and the rest, causing a 23-fold increase man, 226 datasets) that each contained at least 20 samples and
in sequencing depth required to recover the first principal were performed on the Affymetrix platform.
component with 90% accuracy (Figure S4). Despite the differences between these datasets in terms of
species and collection conditions, they all possessed favorable
Large-Scale Survey Reveals that Shallow mRNA-Seq Is principal value distributions reflecting an effective low dimen-
Widely Applicable due to Gene-Gene Covariance sionality. For instance, on average the first principal value was
Both our analysis of Equation 1 and our computational investiga- roughly twice as large as the second principal value, and
tions of mRNA-seq datasets suggest that high gene-gene co- together the first five principal values explained a significant ma-
variances increase the distance of leading principal values jority of the variance, suggesting that these datasets contain a
from the rest, thereby enabling the recovery of dominant prin- few, dominant principal components (Figure 5A, left). By shuf-
cipal components at low mRNA-seq read depths. This finding, fling these datasets to reorder the sample labels for each gene,

Cell Systems 2, 239250, April 27, 2016 245


Figure 5. Gene Expression Survey of 352 Public Datasets Reveals Broad Tolerance of Bioinformatics Analysis to Shallow Profiling
(A) Variance explained by the first five transcriptional programs of 352 published yeast, mouse, and human microarray datasets (left). Shuffling microarray
datasets removes gene-gene covariance and destroys the relative dominance of the leading transcriptional programs. Read depth required to recover with 80%
accuracy the first five principal components of the 352 datasets (right). Removing gene expression covariance from the data requires a median of approximately
ten times more reads to achieve the same accuracy.
(B) Accuracy of GSEA of the human microarray datasets at low read depth (100,000 reads, i.e., 1% deep depth). Reactome pathway database gene sets are
correctly identified (blue) or not identified (yellow) at low read depth (false positives in red). 80% of gene sets can be correctly recovered at 100,000 reads.
(C) Accuracy of GSEA as a function of read depth.

we again found that these principal components emerge from reads per sample), we found that >60% of gene set enrichments
gene-gene covariance. were retained with only 1% of the reads (Figures 5B and 5C). This
We related this pattern of dominant principal components analysis demonstrates that biological information was also re-
to the ability to recover biological information with shallow tained at low depth.
mRNA-seq in these datasets. To generate synthetic mRNA- Collectively, our analyses demonstrate that the success of
seq data from these microarray datasets, we applied a probabi- low-coverage sequencing relies on a few dominant transcrip-
listic model to simulate mRNA-seq at a given read depth (see the tional programs. We also show that many gene expression data-
Experimental Procedures). We found that with only 60,000 reads sets contain such noise-resistant programs as determined by
per sample, 84% of the 352 datasets have %20% error in their PCA and identified them with dominant dimensions in the data-
first principal component. This translates into an average of set. Furthermore, low dimensionality and noise robustness are
almost 1,000% read depth savings to recover the first principal properties of the gene expression datasets themselves and exist
component with an acceptable PCA error tolerance of 20% (Fig- independent of the choice of analysis technique. Therefore, un-
ure 5A, right). By applying gene set enrichment analysis (GSEA) supervised learning methods other than PCA would reach similar
to the first principal component of each of the 352 datasets at conclusions, an expectation we verified using non-negative ma-
low (100,000 reads per sample) and high read depths (10 million trix factorization (Figure S5B).

246 Cell Systems 2, 239250, April 27, 2016


The Read Depth Calculator: A Quantitative Framework Experimentalists can use the read depth calculator to pre-
for Selecting Optimal mRNA-Seq Read Depth and dict requirements for read depth or sample number in high-
Number of Biological Samples throughput transcriptional profiling given their desired accuracy
Because the optimal choice of read depth in an mRNA-seq based on the statistics of principal value separation in our global
experiment is of widespread practical relevance, we developed survey. Figure 6B shows the reads required for desired accu-
a read depth calculator that can provide quantitative guidelines racies and an assumed principal value for a human transcrip-
for shallow mRNA-seq experimental design. Having pinpointed tional experiment with 100 samples (typical values for the first
the factors that determine the applicability of shallow mRNA- five principal values for human are indicated in dashed lines).
seq, we applied this understanding to determine the read depth As an illustration, a hypothetical experiment with a typical first
and number of biological samples to profile when designing an principal value of 1.4 3 105 (median principal value from the
experiment. To do so, we simplified the principal component er- 226 human microarray datasets) and 100 samples where 80%
ror described by Equation 1 by assuming that the principal PCA accuracy is tolerable requires less than 5,000 reads per
values of mRNA-seq data are well separated, i.e., that the ratio experiment or less than 500,000 reads in total, occupying less
between consecutive principal values li+1/li is small (as defined than 0.125% of a single sequencing lane in the Illumina HiSeq
in the Supplemental Information, section 2.1), an assumption 4000.
justified by our large-scale microarray survey (see Figures S5C The predictions from this analytically derived read depth
and S5D). These assumptions enable us to provide simple calculator are demonstrably accurate. We compared the analyt-
guidelines for making important experimental decisions, for ically predicted number of reads required for 80% PCA accuracy
example, choosing read depth, N: in the first five transcriptional programs to the value determined
through simulated shallow mRNA-seq for 226 microarray and 4
k2 mRNA-seq human datasets. We determined k empirically by
Nz 2
(Equation 2)
fitting 50% of the datasets. Cross-validation with the remaining
nli kpci  d
pci k
50% of the datasets showed remarkable agreement between
where n is the number of biological samples and k is a constant the analytical predictions and computationally determined
that can be estimated from existing data (see the Supplemental values. In these calculations, the analytically predicted number
Information, section 2.1 for a derivation of this equation and its of reads required to reach 80% accuracy deviates from the
limitations). This relationship can be understood intuitively. First, depth required in simulation by less than 10% (Figure 6C). The
Equation 2 states that the principal component error decreases read depth calculator is available online (http://thomsonlab.
with read depth, a consequence of the well-known fact that the github.io/html/formula.html).
signal-to-noise
p ratio of a Poisson random variable is proportional Finally, while we use the first principal component for illustra-
to N. The read depth also depends on li, which comes from the tion, Equation 2 can be applied to any principal component,
li  lj term of Equation 1. Finally, the influence of the sample including the trailing principal components. Recent work dis-
number n on read depth follows from the definition of covariance cusses a statistical method to identify those principal compo-
as an average over samples. (Figure S5E shows that n is approx- nents that are likely to be informative, and this work can be
imately statistically uncorrelated with principal values across the used in conjunction with Equation 2 to pinpoint the relevant prin-
microarray datasets.) cipal components and the sequencing parameters needed to es-
Equation 2 has implications for optimizing the tradeoff between timate them satisfactorily (Klein et al., 2015).
read depth and sample number in single-cell mRNA-seq experi-
ments. As principal component error depends on the product of DISCUSSION
read depth and number of samples, error in mRNA-seq analyses
can be reduced equivalently in two ways, by either increasing the Single-cell transcriptional profiling is a technology that holds
total number of profiled cells or the transcript coverage. To illus- the promise of unlocking the inner workings of cells and uncov-
trate this point, we computationally determined the error in the ering the roots of their individuality (Klein et al., 2015; Macosko
first principal component of the single cell mouse brain data et al., 2015). We show that for many applications that rely on
from Zeisel et al. (2015) as a function of cell number. Consistent the determination of transcriptional programs, biological in-
with Equation 2, our calculations show that increasing the num- sights can be recapitulated at a fraction of the widely proposed
ber of profiled cells reduces error in the first principal component high read depths. Our results are based on a rigorous mathe-
(Figure 6A). Furthermore, we show that with the Zeisel et al. (2015) matical framework that quantifies the tradeoff between read
data, multiple different experimental configurations with the depth and accuracy of transcriptional program identification.
same total number of transcripts can yield the same principal Our analytical results pinpoint gene-gene covariance, a ubiqui-
component error. For example, 100,000 transcripts divided tous biological property, as the key feature that enables un-
between either 50 or 400 cells both yield a principal component compromised performance of unsupervised gene expression
error of 20%. This result is of particular relevance in single- analysis at low read depth. The same mathematical framework
cell experiments because transcript depth per cell is currently also leads to practical methods to determine the optimal
limited by a 20% mRNA capture efficiency, and so cannot be read depth and sample number for the design of mRNA-seq
easily increased (Shalek et al., 2014). In such cases, limited experiments.
sequencing resources might be best used to sequence more Given the principal values that we observe in the human micro-
cells at low depth rather than allocating sequencing resources array datasets, our analysis suggests that one can profile tens of
to oversampling a few thousand unique transcripts. thousands of samples, as opposed to dozens, while still being

Cell Systems 2, 239250, April 27, 2016 247


Figure 6. Mathematical Framework Provides a Read Depth Calculator and Guidelines for Shallow mRNA-Seq Experimental Design
(A) Error in the first principal component of the Zeisel et al. (2015) dataset for varying cell number and read-depth. Black circles denote a fixed number of total
transcripts (100,000). Error can be reduced by either increasing transcript coverage or the number of cells profiled.
(B) Number of reads required (color) to achieve a desired error (y axis) for a given principal value (x axis). Typical principal values (dashed black vertical lines) are
the medians across the 352 gene expression datasets.
(C) Error of the read depth calculator (Equation 2) across 176 gene expression datasets used for validation (out of 352 total). The calculator predicts the number of
reads to achieve 80% PCA accuracy in each dataset (colored dots). The predicted values closely agree with simulated results, with the median error <10% for the
first five transcriptional programs.

able to accurately identify transcriptional programs. At this scale, ance of early principal components. These leading, noise-robust
researchers can perform entire chemical or genetic knockout principal components are effectively a small number of dimen-
screens or profile all 1,000 cells in an entire Caenorhabditis sions that dominate the biological phenomena under investiga-
elegans, 40 times over, in a single 400,000,000 read lane on tion. These insights are consistent with previous observations
the Illumina HiSeq 4000. Because shallow mRNA-based screens that were made following the advent of microarray technology
would provide information at the level of transcriptional pro- (Eisen et al., 1998; Segal et al., 2003; Bergmann et al., 2003), pro-
grams and not individual genes, complementing these experi- posing that low dimensionality arises from extensive covariation
ments by careful profiling of specific genes with targeted in gene expression. We suggest that the covariances and prin-
mRNA-seq (Fan et al., 2015) or samples of interest with conven- cipal values in gene expression are determined by the architec-
tional deep sequencing would provide a more complete picture tural properties of the underlying transcriptional networks, such
of the relevant biology. as the co-regulation of genes, and therefore it is the biological
Fundamentally, our results rely on a natural property of gene system itself that confers noise tolerance in shallow mRNA-seq
expression data: its effective low dimensionality. We observed measurements. Related work in neuroscience has explored the
that gene expression datasets often have principal values that implications of hierarchical network architecture for learning
span orders of magnitude independently of the measurement the dominant dimensions of data (Saxe et al., 2013; Hinton and
platform and that this property is responsible for the noise toler- Salakhutdinov, 2006).

248 Cell Systems 2, 239250, April 27, 2016


Discovering and exploiting low dimensionality to reduce un- John Haliburton, Sisi Chen, and Emeric Charles for their experimental insights;
certainty in measurements is at the heart of modern signal pro- and Paul Rivaud for website design assistance. This work was supported
by the UCSF Center for Systems and Synthetic Biology (NIGMS P50
cessing techniques (Donoho 2006; Candes et al., 2006). These
GM081879). H.E.S. acknowledges support from the Paul G. Allen Family Foun-
methods first found success in imaging applications, where dation. M.T. acknowledges support from the NIH Office of the Director, the Na-
low dimensionality arises from the statistics and redundancies tional Cancer Institute, and the National Institute of Dental and Craniofacial
of natural images, enabling most images to be accurately repre- Research (NIH DP5 OD012194).
sented by a small number of wavelets or other basis functions.
Our results suggest that shallow mRNA-seq is similarly enabled Received: November 30, 2015
by an inherent low dimensionality in gene expression datasets Revised: March 8, 2016
Accepted: April 4, 2016
that emerges from groups of covarying genes. Just as only a
Published: April 27, 2016
few wavelets are needed to represent most images, only a few
groups of transcriptional programs seem to be necessary to pro-
duce a coarse-grained representation of transcriptional state. REFERENCES
We believe that the measurement of many diverse biological
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and
systems could benefit from the identification and analysis of hid-
Levine, A.J. (1999). Broad patterns of gene expression revealed by clustering
den low-dimensional representations. For instance, proteome analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
quantification, protein-protein interactions, and human genetic Proc. Natl. Acad. Sci. USA 96, 67456750.
variant data all contain high levels of correlations, suggesting Alter, O., Brown, P.O., and Botstein, D. (2000). Singular value decomposition
these datasets may all be effectively low dimensional. We antic- for genome-wide expression data processing and modeling. Proc. Natl.
ipate new modes of biological inquiry as advances from signal Acad. Sci. USA 97, 1010110106.
processing are integrated into biological data analysis and as Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., and
the underlying structural features of biological networks are ex- Ouimet, M. (2004). Learning eigenfunctions links spectral embedding and
ploited for large-scale measurements. kernel PCA. Neural Comput. 16, 21972219.
Bergmann, S., Ihmels, J., and Barkai, N. (2003). Iterative signature algorithm
EXPERIMENTAL PROCEDURES for the analysis of large-scale gene expression data. Phys. Rev. E Stat.
Nonlin. Soft Matter Phys. 67, 031902.
Simulated Shallow Sequencing through Down-sampling of Reads
Bonneau, R. (2008). Learning biological networks: from modules to dynamics.
Transcriptional datasets were obtained from the GEO (Zeisel et al. [2015] was
Nat. Chem. Biol. 4, 658664.
from http://www.linnarssonlab.org). mRNA-seq read counts were normalized
by the total number of reads in the sample. For each read depth, we model the Candes, E.J., Romberg, J.K., and Tao, T. (2006). Stable signal recovery from
sequencing noise with a multinomial distribution. The Zeisel et al. (2015) data incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59,
were sampled without replacement because of the unique molecular identi- 12071223.
fiers (see Supplemental Experimental Procedures). Ding, C., and He, X. (2004). K-means clustering via principal component anal-
ysis. ICML Proceedings of the 21st International Conference on Machine
Finding Genes Significantly Associated with a Principal Component Learning (ACM), p. 29.
We first generated a null distribution of gene loadings from the principal com-
Donoho, D.L. (2006). Compressed sensing. IEEE Trans. Inf. Theory 52, 1289
ponents of a shuffled, transcript-count matrix. All p values were computed with
1306.
respect to this distribution; averages over 15 replicates are reported.
Duarte, M.F., Davenport, M.A., Takbar, D., Laska, J.N., Sun, T., Kelly, K.F., and
Gene Set Enrichment Analysis Baraniuk, R.G. (2008). Single-pixel imaging via compressive sampling. IEEE
GSEA was performed with 1,370 gene lists from MSigDB (Subramanian et al., Signal Process. Mag. 25, 8391.
2005). The loadings of each principal component were collected in a distribu- Edgar, R., Domrachev, M., and Lash, A.E. (2002). Gene Expression Omnibus:
tion and loadings within 2 SDs from the mean of this distribution were consid- NCBI gene expression and hybridization array data repository. Nucleic Acids
ered for analysis. We applied a hypergeometric test with a significance p value Res. 30, 207210.
cutoff of 104.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster anal-
ysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci.
SUPPLEMENTAL INFORMATION USA 95, 1486314868.

Supplemental Information includes Supplemental Experimental Procedures Fan, H.C., Fu, G.K., and Fodor, S.P.A. (2015). Expression profiling.
and five figures and can be found with this article online at http://dx.doi.org/ Combinatorial labeling of single cells for gene expression cytometry. Science
10.1016/j.cels.2016.04.001. 347, 1258367.
Ham, J., Lee, D.D., Mika, S., and Scholkopf, B. (2004). A kernel view of the
AUTHOR CONTRIBUTIONS dimensionality reduction of manifolds. ICML Proceedings of the 21st
International Conference on Machine Learning (ACM), p. 47.
G.H., H.E.-S., and M.T. conceived the idea. G.H. wrote the simulations Hinton, G.E., and Salakhutdinov, R.R. (2006). Reducing the dimensionality of
and analyzed data, with input from M.T. and H.E.-S. R.B. and M.T. performed data with neural networks. Science 313, 504507.
theoretical analysis. R.B. wrote the mathematical proofs. The manuscript was
written by G.H., R.B., H.E.-S., and M.T. Holter, N.S., Maritan, A., Cieplak, M., Fedoroff, N.V., and Banavar, J.R. (2001).
Dynamic modeling of gene expression data. Proc. Natl. Acad. Sci. USA 98,
ACKNOWLEDGMENTS 16931698.
Jaitin, D.A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I.,
The authors would like to thank Jason Kreisberg, Alex Fields, David Sivak, Pat- Mildner, A., Cohen, N., Jung, S., Tanay, A., and Amit, I. (2014). Massively par-
rick Cahan, Jonathan Weissman, Chun Ye, Michael Chevalier, Satwik Ra- allel single-cell RNA-seq for marker-free decomposition of tissues into cell
jaram, and Steve Altschuler for careful reading of the manuscript; Eric Chow, types. Science 343, 776779.

Cell Systems 2, 239250, April 27, 2016 249


Klein, A.M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Shalek, A.K., Satija, R., Adiconis, X., Gertner, R.S., Gaublomme, J.T.,
Peshkin, L., Weitz, D.A., and Kirschner, M.W. (2015). Droplet barcoding for sin- Raychowdhury, R., Schwartz, S., Yosef, N., Malboeuf, C., Lu, D., et al.
gle-cell transcriptomics applied to embryonic stem cells. Cell 161, 11871201. (2013). Single-cell transcriptomics reveals bimodality in expression and
Kliebenstein, D.J. (2012). Exploring the shallow end; estimating information splicing in immune cells. Nature 498, 236240, advance online publication.
content in transcriptomics studies. Front. Plant Sci. 3, 213. Shalek, A.K., Satija, R., Shuga, J., Trombetta, J.J., Gennert, D., Lu, D., Chen,
Kumar, R.M., Cahan, P., Shalek, A.K., Satija, R., DaleyKeyser, A.J., Li, H., Zhang, P., Gertner, R.S., Gaublomme, J.T., Yosef, N., et al. (2014). Single-cell RNA-
J., Pardee, K., Gennert, D., Trombetta, J.J., et al. (2014). Deconstructing tran- seq reveals dynamic paracrine control of cellular variation. Nature 510,
scriptional heterogeneity in pluripotent stem cells. Nature 516, 5661. 363369.
Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Shankar, R. (2012). Principles of Quantum Mechanics (Springer Science &
Tirosh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., et al. (2015). Highly par- Business Media).
allel genome-wide expression profiling of individual cells using nanoliter drop-
lets. Cell 161, 12021214. Shen, Y., Yue, F., McCleary, D.F., Ye, Z., Edsall, L., Kuan, S., Wagner, U.,
Dixon, J., Lee, L., Lobanenkov, V.V., and Ren, B. (2012). A map of the cis-reg-
Ng, A.Y., Jordan, M.I., and Weiss, Y. (2001). On spectral clustering: analysis
ulatory sequences in the mouse genome. Nature 488, 116120.
and an algorithm. In Advances in Neural Information Processing Systems
(MIT Press), pp. 849856. Stewart, G.W., and Sun, J. (1990). Matrix Perturbation Theory (Academic
Patel, A.P., Tirosh, I., Trombetta, J.J., Shalek, A.K., Gillespie, S.M., Wakimoto, Press).
H., Cahill, D.P., Nahed, B.V., Curry, W.T., Martuza, R.L., et al. (2014). Single- Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette,
cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and Mesirov, J.P.
Science 344, 13961401. (2005). Gene set enrichment analysis: a knowledge-based approach for inter-
Pollen, A.A., Nowakowski, T.J., Shuga, J., Wang, X., Leyrat, A.A., Lui, J.H., Li, preting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102,
N., Szpankowski, L., Fowler, B., Chen, P., et al. (2014). Low-coverage single- 1554515550.
cell mRNA sequencing reveals cellular heterogeneity and activated signaling Trapnell, C. (2015). Defining cell types and states with single-cell genomics.
pathways in developing cerebral cortex. Nat. Biotechnol. 32, 10531058. Genome Res. 25, 14911498.
Ringner, M. (2008). What is principal component analysis? Nat. Biotechnol. 26,
Treutlein, B., Brownfield, D.G., Wu, A.R., Neff, N.F., Mantalas, G.L., Espinoza,
303304.
F.H., Desai, T.J., Krasnow, M.A., and Quake, S.R. (2014). Reconstructing line-
Roweis, S.T., and Saul, L.K. (2000). Nonlinear dimensionality reduction by age hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature
locally linear embedding. Science 290, 23232326. 509, 271375.
Saxe, A.M., Mcclelland, J.L., and Ganguli, S. (2013). Learning hierarchical
Van der Maaten, L., and Hinton, G. (2008). Visualizing data using t-SNE.
category structure in deep neural networks. Proceedings of the 35th Annual
J. Mach. Learn. Res. 9, 85.
Meeting of the Cognitive Science Society, pp. 12711276.
Segal, E., Shapira, M., Regev, A., Peer, D., Botstein, D., Koller, D., and Velculescu, V.E., Madden, S.L., Zhang, L., Lash, A.E., Yu, J., Rago, C., Lal, A.,
Friedman, N. (2003). Module networks: identifying regulatory modules and Wang, C.J., Beaudry, G.A., Ciriello, K.M., et al. (1999). Analysis of human tran-
their condition-specific regulators from gene expression data. Nat. Genet. scriptomes. Nat. Genet. 23, 387388.
34, 166176. Zeisel, A., Munoz-Manchado, A.B., Codeluppi, S., Lonnerberg, P., La Manno,
Shai, R., Shi, T., Kremen, T.J., Horvath, S., Liau, L.M., Cloughesy, T.F., G., Jureus, A., Marques, S., Munguba, H., He, L., Betsholtz, C., et al. (2015).
Mischel, P.S., and Nelson, S.F. (2003). Gene expression profiling identifies mo- Brain structure. Cell types in the mouse cortex and hippocampus revealed
lecular subtypes of gliomas. Oncogene 22, 49184923. by single-cell RNA-seq. Science 347, 11381142.

250 Cell Systems 2, 239250, April 27, 2016


Cell Systems, Volume 2

Supplemental Information

Low Dimensionality in Gene Expression Data


Enables the Accurate Extraction of Transcriptional
Programs from Shallow Sequencing

Graham Heimberg, Rajat Bhatnagar, Hana El-Samad, and Matt Thomson


SUPPLEMENTALINFORMATION
TableofContents

SupplementalFigures
FigureS1
FigureS2
FigureS3
FigureS4
FigureS5

1 SupplementalFigureLegends 1
2 SupplementalTheory 4
2.1 Shallowsequencing 4
2.2 Geneexpressionmodules 12
3 SupplementalExperimentalProcedures 14
SupplementalReferences 17
100%
A B
Principal component error Program 1
80% 2
3 80%

Principal component error


8
15
60%
60% 10

40% 32
100
40%
316
20% 1,000
3,162
20%
3 4 5 6 7 10,000 (x103 reads)
10 10 10 10 10
Reads per sample 10 20 30
Principal component index
C D
1
principal component 1

Negative Predictive Value


formation of tubulin folding intermediates...
prefoldin mediated transfer of substrate... Significance
-log10(p)
sig regulation of the actin cytoskeleton...
10
axon guidance
protein folding 0.99
cell cycle
4 Principal component 1
mitotic prometaphase
Insignificant 2
pathogenic escherichia coli infection
0 3
oocyte meiosis
ncadherin pathway
1 10 100 1000 10000
0.98

Reads per sample(10 ) 3


3 4 5 6 7
10 10 10 10 10
Reads per sample
E 1 F
100%
Principal component error in FPKM units
Positive Predictive Value

Principal component 1
principal component error

0.9 80%
2
3
8
15
Principal component 1 60%
0.8 2
3
40%

0.7 1% deep 99%


20% read depth
3 4 5 6 7
10 10 10 10 10
Reads per sample 3 4 5 6 7
10 10 10 10 10
Reads per sample
G H
Distinguishing tissue types with FPKM units 10,000 reads
5
Principal Value Decay in FPKM units x 10 x 10
3
32,000 reads
Differences between principal values (+1)

8
70,000 reads
20% 1
bonemarrow
Variance explained

liver
4 spleen
E14.5liver
thymus
10% 0.5
0 cerebellum
PC3

cortex
olfactory
4 MEF
mESC
E14.5brain
4 8 12 16
program index (i) E14.5limb
8
heart
10 5 0 5 E14.5heart
PC2 x 10
3

Figure SI 1
A

Differences between principal values Differences between principal values


Variance explained (Zeisel et al.)
5
x 10

20%
5

5 10 15 20
B principal value index C D
100%
Variance explained (Truetlein et al.)

5 4
x 10 x 10
8%

Principal component error


4
1.2 80%

Principal component 2
6% PC 1 0
2
60% 3
0.8
4 3200 reads
4% 40% 6800 reads
1% deep
read depth 99% 8 15,000 reads
0.4
20% Day 14.5 (1107 reads)
2% Day 16.5 (1107 reads)
12
Day 18.5 (1107 reads)
3 4 5 6 8 4 0 4 8
5 10 15 20 10 10 10 10 10
7

read depth Principal component 1 x 10


4
principal value index

E 5
F 100% G 4
variance explained (Kumar et al.)

x 10 x 10
Differences between principal values

Principal component error

1 3200 reads

Principal component 2
30% 1.2 80%
6800 reads

60% 0 32,000 reads


20% 0.8 PC 1
2 Dgrc8 -/-
3
40% 1 WT
1% deep
10% 0.4 read depth 99% Neural Progenitor
20% 2

5 10 15 20 3
10 10
4
10
5
10
6
10
7
2 0 2 4
principal value index
4
read depth Principal component 1
x 10

H I 100%
J 3
Differences between principal values

x 10
5
x 10
2
Variance explained (Shalek et al.)

20%
Principal component error

2200 reads

Classification line
Principal component 2

80% 2
4600 reads
10,000 reads
60% PC 1 1
2 1107 reads
3
10% 1
40%
0

1% deep 99%
20%
read depth 1
not mature mature
4 8 12 16
3
10 10
4
10
5
10
6 7
10 1 0 1 2 3
principal value index read depth Principal component 1 x 10
3

t-SNE
Data from Kumar et al.

LLE

t-SNE
Data from Shalek et al.

LLE

Figure SI 2
A 1 B pc1
500
}2
12

gene expression
1 between cluster
cluster
i 300 2 covariance r
1
1 within cluster
covariance
100
High
covariance

No
covariance
gene variance mi
Negative
covariance

D
C 4
10
R2=0.13
Principal value magnitude

PC 1 loading error (%)


20 1 4
2
10
1 3
1 2

10 0
10

2
10
Increasing within cluster covariance
High
covariance

5 4 3 2
10 10 10 10
genes

No
covariance
Gene expression level (normalized read counts)
genes Negative
covariance

Figure SI 3
variance explained 20%

u ns
10% hu
d
e da
ta
shu ed data

10 20 30
principal value index
Modular Non-modular High
(covariance removed) covariance

genes
genes

Negative
covariance
genes genes

Noise Tolerant Noise Sensitive


~100,000 reads ~ 2.3 million reads
100%
100%
Principal component error

Principal component error

80%
90% accuracy

80%
90% accuracy
60% 60%

40% 40% Program 1


2
3
20% Program 1 20%
2
3
3 4 5 6 7 3 4 5 6 7
10 10 10 10 10 10 10 10 10 10
Reads per sample Reads per sample

data: Shen et al.


Figure SI 4
A
Shen
Treutlein
5 Shalek
10 Kumar
Chen
Pollen
Median (shuffled)
6
10
i - i+1

7
10

8
10
2 3 4 5 6 7 8 9
principal value index

B
Yeast Mouse Human

0
R2= 0.19 10 R2= 0.69 R2= 0.48
median NMF error
median NMF error

median NMF error


1
210

1
1 10
10
1
10
5 4 6 5 4 5 4
510 10 10 10 10 10 10
1 + 2 + 3 1 + 2 + 3 1 + 2 + 3

C Principal value decay human data D Principal value decay shuffled human data

1 1
i/ 1

i/ 1

0.5 0.5

2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
principal value index principal value index (shuffled data)

E
3 3 3
10 10 10 Human
Yeast Mouse
R2= 0.0846 R2= 0.0007
first principal value (1)

first principal value (1)

first principal value (1)

R2= 0.0426

5 5
10 10
5
10

7 7
1 2 10 1 2 10 1 2 3
10 10 10 10 10 10 10
sample number (n) sample number (n) sample number (n)

Figure SI 5
1 Supplemental Figure Legends
Figure S1: Stability of principal component error and gene set enrichment analy-
sis across down-sampling replicates for the Shen et al. dataset, related to Figure 2
(A) Mean (solid lines) and standard deviation (error bars) in principal component error of
mouse tissue data (Shen et al.), as a function of read depth as calculated from 20 simulated
shallow sequencing experiments at each of 25 indicated read depths. Narrow width of error
bars illustrates the stability of the PCA error calculation to the downsampling procedure.
The mean PCA error curves are also shown in Figure 2A.
(B) Principal component error for first 38 principal components of the mouse tissue data
at 7 read depths, illustrating the number of principal components that can be accurately
reconstructed as sample read depth is decreased. For example, nine principal components
can be reconstructed at less than 20% error with only 133,000 reads per sample.
(C) Gene Set Enrichment Analysis for principal component 1 of the mouse tissue dataset
at decreasing read depth. Significant gene sets (see scale bar) are stable even below 32, 000
reads. Figure 2 focuses on analysis of principal components 2 and 3 as they are of more
biological relevance for classification.
(D) Negative Predictive Value of Gene Set Enrichment Analysis applied to mouse tissue
data for the first three principal components (color) over a large range of read depths. Neg-
ative Predictive Value indicates the fraction of gene sets correctly considered insignificant
out of all gene sets considered insignificant.
(E) Positive Predictive Value of Gene Set Enrichment Analysis applied to mouse tissue data
for the first three principal components (color) over a large range of read depths. Positive
Predictive Value indicates the fraction of gene sets correctly considered statistically signifi-
cant out of all gene sets considered statistically significant.
(F) Principal component error as a function of read depth for selected principal components
for the Shen et al. data as in Figure 2A. Here the transcriptional programs are calculated
from the FPKM values rather then read count data. Again the first three principal com-
ponents can be recovered with >80% accuracy with just 1% of the traditional read depths.
Improvements in error exhibit diminishing returns as read depth is increased. Less domi-
nant transcription programs (principal components 8 and 15 shown) are more sensitive to
sequencing noise.
(G) Variance explained by transcriptional program (blue) and differences between principal
values (green) calculated from the FPKM values. Like the read count data, the leading,
dominant transcriptional programs have principal values that are well-separated from later
principal values suggesting that these should be more robust to sequencing noise.
(H) Projection of a subset of the mouse tissue data onto principal components two and
three as in Figure 2. Here, principal components are calculated with FPKM values, rather
than read count data. As in Figure 2D, the ellipses represent uncertainty due to sequenc-
ing noise at specific reads depths. Again, similar tissues lie close together. Transcriptional
program two separates neural tissues from non-neural tissues while transcriptional program
three distinguishes tissues involved in haematopoiesis from other tissues.

1
Figure S2: Principal value separation is large in single cell mRNA-seq datasets,
related to Figure 3
(A) Variance explained by principal components (blue) and differences between principal
values (green) of the Zeisel et al. data. Similar to the bulk mRNA-seq data, the leading,
dominant transcriptional programs have principal values that are well-separated from later
principal values suggesting that these should be more robust to measurement noise. See
Figure 3 for the principal component error and cell-type classification accuracy as a function
of transcript coverage.
(B) Variance explained by principal components (blue) and differences between principal
values (green) of the Treutlein et al. data.
(C) Principal component error as a function of read depth for the first three principal com-
ponents for the Treutlein et al. data.
(D) Transcriptional state of single cells during lung development at two time points E16.5
and 18.5 from Treutlein et al. projected onto the first two principal components (see Supple-
mental Experimental Procedures). Radii indicate error at given read depth. Developmental
stages corresponding to nascent (16.5) and mature (18.5) progenitor cells can be distin-
guished at 3,200 reads.
(E) Variance explained by principal components (blue) and differences between principal
values (green) of the Kumar et al. data.
(F) Principal component error as a function of read depth for the first three principal com-
ponents for the Kumar et al. data.
(G) Projection of the transcriptional state of wild type and Dgcr8 knockout mouse embry-
onic stem cells from Kumar et al. on the first two principal components. The wild type cells
separate from the knockout cells which are deficient in miRNA processing at 3,200 reads.
(H) Variance explained by principal components (blue) and differences between principal
values (green) of the Shalek et al. data.
(I) Principal component error as a function of read depth for the first three principal com-
ponents for the Shalek et al. data.
(J) Projection of the transcriptional state of 18 bone-marrow-derived dendritic single cells
from Shalek et al. data on the first two principal components. The mature and not ma-
ture cells are distinguishable at 3,200 reads.
(K) and (L) Distinct transcriptional states in single cells can be uncovered by the nonlinear
unsupervised learning methods t-SNE and LLE at low read depth. Computed clusters in
Kumar et al. data and Shalek et al. data at 105 reads are almost identical to those obtained
at 107 reads, with significant information preserved even at 103 reads.

Figure S3: Impact of gene expression covariance and absolute gene expression
level on principal value separation, related to Figure 4
(A) Principal value separation increases with module size. Principal values i shown for
a sixteen gene system for increasing module size b (with q = 40, r = 8, and mi constant
within blocks and spanning [80, 200]).
(B) Block-diagonal covariance matrix for a general model of gene expression analyzed in
the Supplemental Information Section 2.2 (the ten gene, two module case is illustrated).
The matrix is annotated with relevant parameters, q, r, mi , b. The first principal component
discriminates membership in the two underlying gene expression modules.
(C) Principal value separation increases or remains constant as within-cluster covariance q
increases. Clustered covariance matrices depicted along x-axis and the first four principal
values are analytically determined from the gene expression model of (B) as described in

2
Supplemental Information Section 2.2, with b = 4, r = 1, m1 = 10, m2 = 6.
(D) Principal component loading error versus absolute gene expression level for the Shen
et al. dataset. For each gene, the absolute gene expression level is normalized read counts
summed across samples. The gene-wise loading error is calculated for gene i as |pc1,i
1,i |/pc1,i at a read depth of 46,000 reads per sample. The weak correlation (R2 = 0.13)
pc
indicates that absolute gene expression level does not significantly contribute to the gene-wise
principal component error.

Figure S4: The modularity of gene expression enables accurate, low depth tran-
scriptional program identification in single cell mRNA-seq data, related to Fig-
ure 4
The gene expression covariance matrix of the Shen et al. data reveals large modules of co-
varying genes (middle), whose signature is a few, dominant transcriptional programs that
explain relatively large variances in the data (top). As predicted by the model, these dom-
inant transcriptional programs are robust to low-coverage profiling (bottom). Shuffling the
Shen et al. data destroys the modular structure, resulting in noise-sensitive transcriptional
programs. For the shuffled data, 2.3 million reads are required for 90% accuracy of the first
three transcriptional programs, whereas 100, 000 reads suffices for the original dataset.

Figure S5: Principal value separation is common in mRNA-seq and microarray


datasets and is due to gene expression covariance, related to Figure 5
(A) Principal value separations i i+1 for six mRNA-seq datasets illustrating the gen-
erality of principal value decay. Median of the differences between principal values for the
datasets after shuffling is shown in black.
(B) Relationship between error in Non-negative Matrix Factorization (NMF) and sum of
first three principal values of microarray data. NMF error calculated as median error across
three NMF parts at 45,000 reads per sample (see Supplemental Experimental Procedures).
As the summed principal values on the x-axis represent the variance explained by the first
three principal components, this indicates that the performance of NMF correlates with the
existence of dominant, transcriptional programs.
(C) Ratio of principal values, i /1 , in human microarray data for 2 i 10. Principal
values are well-separated.
(D) After shuffling datasets to remove gene expression covariance, the principal values are
no longer well-separated.
(E) Sample number (x-axis) and the magnitude of the first principal value 1 (y-axis) are ap-
proximately uncorrelated. Left: 20 yeast datasets, R2 = 0.04. Middle: 106 mouse datasets,
R2 = 0.08. Right: 226 human datasets, R2 = 7.0 104 .

3
2 Supplemental Theory
2.1 Shallow sequencing
This section develops the theoretical framework used in the main text to analyze shallow
sequencing. We use perturbation theory to find how principal components of shallow data
differ from those of deep data and explore this relationship in the context of a simple,
multinomial noise model for mRNA-sequencing. In the process, we provide background on
Equation (1) and derive Equation (2) of the main text.

We begin by summarizing and extending the notation of the main text.

2.1.1 Introduction and Notation


Suppose we have collected reads from deep mRNA-seq experiments in a matrix G of di-
mensions g n, where g is the number of genes in the genome and n is the number of
experimental samples analyzed. Each entry satisfies 0 Gij Ndeep , where Ndeep is the
total number of reads collected in each sample and is assumed (for convenience) to be con-
stant across samples. Then Pij0 , the probability that a transcript from sample j maps to
gene i, is equal to Gij /Ndeep . We assume that Ndeep is large enough that Pij0 represents the
true, underlying probabilities of gene expression. It isPfrequently more convenient to work
with the row-centered probabilities, Pij , Pij0 n1 j Pij0 . For instance, the deep gene
covariance matrix C, of dimensions g g, is of fundamental interest and can be written
simply as C , P P T /(n 1), where P is a matrix with entries Pij . Note that while we
define the covariance matrix in terms of transcript probabilities, we could similarly define
the covariance matrix in terms of transcripts measured in FPKM units, by rescaling each
Pij by a gene length dependent factor. We choose to work with transcript probabilities for
mathematical convenience.

Now assume that we repeat the sequencing experiments with only N  Ndeep reads.
From these shallow mRNA-seq experiments, we obtain data Gij from which we compute the
gene expression probabilities Pij and gene-gene covariances Cij , which we collect in matrices
P and C. (In general, we put hats on the quantities calculated from shallow data.) We
are primarily interested in minimizing the number of reads while preserving the biologically
relevant information contained in C. In particular, this section addresses the question of
how does sequencing depth N affect the distance between the ith principal component of P
and the ith principal component of P ?

The principal components vi and principal values i of the probability matrix P are the
eigenvectors and eigenvalues of the covariance matrix C and therefore satisfy

Cvi = i vi . (2.1a)

We adopt the convention that the eigenvectors are sorted by decreasing eigenvalue, so v1 is
the first eigenvector of C, corresponding to the direction of maximum variance in the data
P . Similarly, the shallow principal components and shallow principal values satisfy

C vi = i vi . (2.1b)

4
The rest of this section is structured as follows. Section 2.1.2 describes a simple, multi-
nomial noise model for mRNA-seq and describes how noise propagates to the shallow covari-
ance matrix. Section 2.1.3 explains how this noise perturbs the deep covariance matrix
and bounds the resulting change in the first principal component, thereby deriving Equa-
tion (2) of the main text. Finally, Section 2.1.4 generalizes this result to higher principal
components.

We additionally use the following notation. A matrix X has elements Xij . The transpose
of a vector x is xT . Expectation is denoted E and variance by V . We use k k or k k2 to
mean the `2 norm of a vector and the spectral norm of a matrix (i.e. the maximum singular
value of the matrix). We write k k1 for the `1 norm and k k for the infinity norm of a
matrix (i.e. the maximum absolute row sum).

2.1.2 Noise Model


We first introduce a noise model that describes the impact of sequencing depth on the data
G. While there are many sources of noise in the measurement of mRNA-transcripts, we
begin with a simplifying assumption.
Assumption 1. The dominant source of noise in the measurement of mRNA-transcripts is
counting noise. Further, all noise, across both genes and samples, is uncorrelated.
Hence the data collected from shallow sequencing is

Gij Binomial(N, Pij0 ).

The maximum likelihood estimate of the true (i.e. obtained from deep data) gene expression
probabilities is simply
P 0 = G/N.
Using the assumption that all noise (across both genes and samples) is uncorrelated and
further assuming that the binomial is well-approximated by the normal distribution, we
have that the underlying probabilities are
1 0
Pij0 Normal Pij0 , Pij (1 Pij0 )

N
which, when row-centered, are
1X 0 1 0
Pij Normal Pij0 Pij (1 Pij0 ) .

Pij , (2.2)
n j N

With this notation, the shallow gene covariance is C , P P T /(n 1).

Similar models for sequencing noise, as well as more specialized models for single-cell
RNA-seq, have been recently proposed (Marioni et al. 2008; McIntyre et al. 2011; Pollen
et al. 2014; Liu, Zhou, and White 2014; Tarazona et al. 2011; Anders and Huber 2010; Islam
et al. 2014; Shiroguchi et al. 2012; Grun, Kester, and van Oudenaarden 2014; Brennecke
et al. 2013; Ding et al. 2015; Daley and Smith 2014; Vallejos, Marioni, and Richardson
2015). Our goal in what follows is to use the simplest noise model to identify the important
parameters and capture their basic dependencies. However, more realistic noise models will
fit comfortably in our framework.

5
To measure the error induced by shallow sequencing, we introduce two definitions.
Definition 1. The error in gene expression probabilities is E , P P .
From equation (2.2), this error is distributed as
1 0
Pij (1 Pij0 ) .

Eij Normal 0, (2.3)
N

Definition 2. The covariance distortion induced by shallow sequencing is D , C C.


With the definitions of gene covariance as well as Definition 1, the covariance distortion can
be expanded as
1
D= (P E T + EP T + EE T ). (2.4)
n1

2.1.3 Perturbation theory

Our goal is to find how the principal components of P differ from those of P . Our approach
treats the covariance distortion D as a (random) perturbation to the deep covariance matrix
C. We then use a result from perturbation theory as well as the properties of the noise
model of Section 2.1.2 to find the resulting change in the principal components of P . Along
the way, we introduce assumptions that reflect properties of biological data that are needed
to simplify the result.

Our main tool from perturbation theory (Stewart and Sun 1990; Shankar 2012) describes
how the eigenvectors of a positive semi-definite matrix change when the matrix is perturbed:
Proposition 1. Let C be a positive semi-definite matrix with eigenvalues 0k and eigenvec-
tors vk0 . Further let
C() = C + D
be a perturbation of C. With some weak assumptions on D, the eigenvalues and eigenvectors
of C are

k () = 0k + 1k
vk () = vk0 + vk1

where the first-order corrections to the eigenvalues and eigenvectors are

1k = vk0T Dvk0

and
X vj0T D vk0
vk1 = vj0 + ak vk0 .
0k 0j
j6=k

Here ak is a constant that is determined by the constraint that vk is unit length.


Equation (1) of the main text immediately follows from this proposition. As explained
in the main results, a natural measure of the error induced by shallow sequencing in the kth

6
principal component is kvk vk k2 . From Proposition 1, this quantity is to first order
v
u 2 X vjT D vk 2
u  
kvk vk k2 ak +
t . (main text equation 1)
k j
j6=k

In this formula, ak can be determined by the convention that vk has unit length,
X  vjT D vk 2
(1 + ak )2 = 1 .
k j
j6=k

We now focus on deriving an upper bound for the expected error of the first principal
component, E kv1 v1 k2 , where the expectation is over noise. As a vectors `2 norm is always
less than its `1 norm, we can bound the expectation of kv1 v1 k1 instead. By Proposition
1, we have to first order

E kv1 v1 k2 E kv1 v1 k1
X vjT Dv1

=E
1 j
j6=1
X 1/2
1 X
T 2
E (vj Dv1 ) (2.6)
(1 j )2
j6=1 j6=1

where we have used the Cauchy-Schwarz Inequality to isolateP


the effects of the numerator
and denominator. As the Pythagorean Theorem states that j (vjT Dv1 )2 = kDv1 k2 , we
have, using the definition of matrix norm, that
X 1/2
1
E kv1 v1 k2 E kDk . (2.7)
(1 j )2
j6=1

The norm of the covariance distortion kDk can be expanded from equation (2.4) as

kDk = kC Ck
= (n 1)1 k(P + E)(P + E)T P P T k
= (n 1)1 kP E T + EP T + EE T k
(n 1)1 2kP E T k + kEE T k


2
kP k kEk + O(kEk2 ), (2.8)
n1
where the last inequality follows from the sub-multiplicativity of the matrix norm. Hence
kDk is bounded by the product of kP k and kEk plus higher order error terms. Putting this
result in equation (2.7), we have
Proposition 2. With the notation established,
X 1/2
2 1
E kv1 v1 k2 E kP k kEk + O(kEk2 ). (2.9)
n1 (1 j )2
j6=1

7
So far our analysis has been general, aside from dropping higher order terms in the perturba-
tion expansion. We nextP analyze in turn each of the three terms on the right side of equation
(2.9), kP k, kEk, and { j6=1 (1 j )2 }1/2 , and introduce assumptions where necessary
to simplify. In particular, while kEk can be simplified in different ways depending on the
assumptions made regarding noise, we will assume the noise model of the previous section
to analyze this term.

The norm of P is easy to compute from the definition of the gene covariance matrix. As
C is defined to equal P P T /(n 1), we have that kP k2 = (n 1)kCk from which follows

kP k2 = (n 1)1 , (2.10)

using the fact that C is positive semi-definite.

Next we turn to the norm of E, which fundamentally represents the strength of the
noise, or the noise power, caused by sequencing at a shallow depth. Evaluating this
quantity is more challenging, as from our noise model analysis of Section 2.1.2, each entry
of E is a gaussian random variable with a different variance. Such matrices are studied in
random matrix theory. For instance, corollary 4.2 of Tropp 2011 provides a tail bound
for the probability that the norm of this random matrix exceeds a fixed quantity. In our
notation, the tail bound states that

Pr{kEk > t} < min{(n + g) exp(t/2 2 ), 1},

where 2 is a variance parameter equal to


 
1 X 0 1 X 0
2 = max max 0
Pjk (1 Pjk ), max 0
Pjk (1 Pjk ) . (2.11)
j N k N
j
k

This tail bound is sufficient to bound the first two moments of kEk as shown in Section 4.3
of Tropp 2011. The second moment of kEk follows from the fact that the expectation of a
non-negative random variable is one minus its cdf, so
Z
E kEk2 = Pr{kEk > t} dt
Z0
min{(n + g) exp(t/2 2 ), 1} dt
0
Z
2
= 2 log(n + g) + (n + g) exp(t/2 2 ) dt
2 2 log(n+g)

= 2 2 log e(n + g).

This directly leads to a bound for expectation of kEk. Since V kEk = E (kEk2 ) (E kEk)2 ,
we have that E kEk (E kEk2 )1/2 from which

E kEk {2 log e(n + g)}1/2 . (2.12)

The term in the square root depends on n and g but very weakly. For instance, taking the
small values of g = 1000 and n = 10, the quantity {2 log e(n + g)}1/2 is 3.97. On the other
hand, for g = n = 105 , the term increases only to 5.14. Hence for values within an order of

8
magnitude of what we may encounter in practice, we incur little error by treating this term
as constant.

We now simplify the variance parameter of equation (2.11). The variance parameter is
the maximum of the largest row sum and the largest column sum of the variances in E. As
typically there are many more genes then samples, the largest column sum is greater than
the largest row sum and therefore determines 2 . More formally we have
Assumption 2. As commonly Pij  1, assume that Pij  Pij2 . Additionally suppose

that kP k
< 1. Thislatter assumption will be satisfied if n is small, say n < 1/ 1 , as
kP k nkP k2 n 1 .
Then the variance parameter 2 reduces to
 
2 1 X 0 1 X 0 1 X 0 1
max max Pjk , max Pjk = max Pjk = . (2.13)
j N k N
j
k N
j
N
k

Combining these results, we have shown


Proposition 3. With Assumptions 1 and 2, the expectation of the norm of the error in the
gene expression probabilities E satisfies

E kEk (2.14)
N
where is a constant ({2 log e(n + g)}1/2 ) effectively independent of the dimensions of the
matrix.
Finally, we analyze the third term on the right side of equation (2.9). Splitting the term
into two sums yields
X 1/2  X X 1 1/2
1 1
= +
(1 j )2 (1 j )2 2
j>n 1
j6=1 jn
j6=1
 1/2
n1 gn
+ , (2.15)
(1 2 )2 21
using the fact that 1 2 1 j for all j > 1. As discussed in the main text, typically the
principal values decay rapidly so that the kth principal value spacing is large with respect
to k . This observation allows us to compare the relative magnitude of the two terms in
equation (2.15) and motivates
Assumption 3. Let k , minj {|k j |} be minimum distance between the kth principal
value and any other principal value. Assume that P satisfies
r
n k
 for all k. (2.16)
gn k
In general, we call matrices that satisfy this property well-separated.
In many cases of interest, k = k k+1 . Then equation (2.16) reduces to a simpler
expression, r
k+1 n
 1 for all k.
k gn

9
Intuitively, this property states that each principal value is much smaller than the one that
preceded it. This property can be checked in actual data in Figures S5C and S5D. With this
assumption, we neglect the first term in the sum of equation (2.15) to obtain
X 1/2  1/2
1 gn
. (2.17)
(j 1 )2 21
j6=1

We return to equation (2.9) and put all of these results together. If we apply the bounds
of (2.10), (2.14), and (2.17) and drop the higher order error terms, we find
 1/2
gn
E kv1 v1 k2 2 .
1 N (n 1)
Since Assumption 3 implies that g  n and in practice n  1, this inequality is approxi-
mately
r
1
E kv1 v1 k2 (2.18)
1 N n
where we have absorbed constants into . This completes our derivation of Equation (2).

Remark It is natural to ask what happens when n is large, contrary to Assumption


2. In this case, the simplification used for the variance parameter in equation (2.13) no
longer holds, and equation (2.14) is replaced with

Pmax n
E kEk
N
where Pmax = maxij {Pij }. Consequently, the best bound our methods show for the error of
the first principal component is
r
1
E kv1 v1 k2 ,
1 N
in place of equation (2.18).

2.1.4 Higher principal components


Similar reasoning shows that bound (2.18) can be adapted to apply to higher principal
components. For principal component k, the previous analysis holds but equation (2.15)
must be generalized to
X 1/2  1/2
1 n1 gn
+
(j k )2 minj (k j )2 2k
j6=k

With this modification, we find, using the same reasoning based on Assumption 3, that
r
1 1
E kvk vk k2 (2.19)
k nN

10
This is a conservative bound that will likely be sufficient for many applications. However,
the bound can be improved as we show next.

The key idea to improving the bound is that D can be written as a sum of rank-one
projections, many of which contribute only second order terms to the perturbation expansion
of Proposition 1. We will remove these second order terms to find a reduced perturbation
that we call D 0 (which has strictly smaller norm than D) and use this perturbation to
bound E kvk vk k through Proposition 1. As an illustration, consider the second principal
component, v2 , and its noisy counterpart, v2 . To first order,
X vjT D v2
v2 v2 = vj .
2 j
j6=2

We expand the terms in D = C C via eigendecomposition to find

X vjT (C C)v2 n n
XX vjT vi viT v2 vjT vi viT v2
X 
v2 v2 = vj = i vj i vj .
2 j i=1
2 j i=1
2 j
j6=2 j6=2

In the inner sum, we incur an error of O(1 kDk2 ) if we choose to skip the i = 1 term. This
is because both vjT v1 and v1T v2 are on the order of kDk. For sufficiently large number of
reads, this is smaller than the i = 2 term which is O(2 kDk) as vjT v2 is O(kDk) and v2T v2
is O(1 kDk2 ). Discarding these second order terms, we have by this analysis

X vjT (C 0 C 0 )v2 X vjT D 0 v2


v2 v2 vj = vj . (2.20)
2 j 2 j
j6=2 j6=2

In this equation, C 0 and C 0 are reduced matrices equal to T T


P P
i>1 vi vi and i>1 vi vi
0 0 0
respectively, and D , C C is the reduced perturbation. In general, to bound the
error of the kth principal component, the first k 1 rank one projections of D onto vi viT
may be projected out with a loss of accuracy of only O(kDk2 ). As the reduced perturbation
D 0 has a smaller norm, we are able to obtain better bounds.

With equation (2.20), we continue like we did with the first principal component, replac-
ing D with D 0 in our analysis. Equation (2.6) is modified to give
X 1/2
1 X
T 0 2
E kv2 v2 k2 E (vj D v2 )
(j 2 )2
j6=2 j6=2
X 1/2
1
E kD 0 k
(j 2 )2
j6=2
  X 1/2
2 1
E kP 0 k kEk + O(kEk2 ) (2.21)
n1 (j 2 )2
j6=2

Here P 0 is the reduced data, found


P by forming the singular value decomposition of P
T
without the first component, i.e. i>1 v w
i i i , and row mean centering the result. By
0 0T T
construction, P P and P P share the same eigenvectors.

11
Now consider the two terms in equation (2.21). The first term in brackets depends on
kP 0 k and E kEk. Equation (2.14) describing E kEk is unchanged but equation (2.10) is now
replaced with
kP 0 k2 = (n 1)2 . (2.22)
The second term in brackets is
X 1/2  X X 1 1/2
1 1
= +
(j 2 )2 (j 2 )2 2
j>n 2
j6=2 1jn
j6=2
 1/2
n1 gn
+
minj (2 j )2 22
Now using the assumption that P is well-separated, this inequality is approximately
X 1/2  1/2
1 gn
. (2.23)
j>2
(j 2 )2 22

Substituting equation (2.22) and equation (2.23) into equation (2.21) and dropping higher
order terms, we have

E kv2 v2 k2 .
nN 2
This technique can be applied iteratively. For the kth principal component, the first k 1
rank one projections of D onto vi viT can be neglected, so that the norm of the reduced
data P 0 is {(n 1)k }1/2 . Hence the error in the kth principal component is

E kvk vk k2 . (2.24)
nN k

2.2 Gene expression modules enhance principal value separation in


a simple model
In the main text we show that principal value separation determines the accuracy of tran-
scriptional program extraction at low read-depths. Through a broad survey of gene expres-
sion datasets, we find that favorable principal value separation is common in biological data
allowing mRNA-seq at a drastically increased scale. In this section, we study a simple gene
expression model to ask how modularity, a core structural property of biological systems,
might impart principal value separation and noise tolerance to gene expression data. The
relationship between gene expression modules and principal value separation is of funda-
mental interest because transcriptional regulatory networks are commonly organized into
regulatory modules, groups of covarying genes. In fact the covariance matrices of both the
Shen et al. and Zeisel et al. datasets contain coherent gene expression blocks that suggest
an underlying modular architecture (Figure 4B and S4). Within the context of a simple
model, we rigorously show that principal value separation is enhanced by such a modular
architecture; principal value separation scales directly with module size and the magnitude of
gene expression covariance within modules. As such, gene expression modules might endow
biological systems with an inherent tolerance to shallow profiling.

12
Gene expression model We consider a gene expression covariance matrix, C, that has
the block diagonal structure shown in Figure S3B. The matrix contains two blocks of size
b b, and each block represents a module of covarying genes. Genes within each red block
have a positive covariance, and the two blocks have relative negative covariance represented
by green. For mathematical convenience, we consider all blocks to share a constant within-
block gene expression covariance q. The gene expression variances mi within each block are
assumed to be constant, but differ between the blocks to avoid module degeneracy. The
between-block covariances are likewise constant and equal to r. We note that for a gene
expression model with two mutually exclusive gene expression modules (i.e. when module 1
is on module 2 is off), r 0 because covariance is calculated following mean centering
of the raw gene expression data. We also assume that q 0, m1 > m2 , m1 > q, m2 > q.
Finally, we assume that q > |r| to ensure that the two blocks of genes are distinct.

Solution to the model In this simple model, we now determine the factors that
influence the separation of the principal values of the underlying data. The principal values,
denoted by i , are defined as the eigenvalues of C and for this model can be calculated
analytically. There are 2b eigenvalues in total: b 1 eigenvalues equal to m1 q, another
b 1 eigenvalues equal to m2 q, and two eigenvalues given by
p
{1 , 2 } = m + (b 1)q (br)2 + {m}2 , (2.25)

with the notation m = (m1 + m2 )/2 and m = (m1 m2 )/2. When q and r both equal
zero, all gene-gene covariances are zero and the system has eigenvalues m1 and m2 , both
with multiplicity b. For nonzero q and r, the non-degenerate eigenvalues separate from the
degenerate eigenvalues. In this case, 1 is the largest eigenvalue of C and, for sufficiently
large q, 2 is the second largest eigenvalue of C (Figure S3C). We note that the noise
tolerance of 1 is of special interest because its associated eigenvector, the first principal
component of the gene expression data, identifies the gene expression modules in the system
(Figure S3B). The entries of this eigenvector, pc1 , are positive for genes within one module
and negative for genes within the other module.

As described in the main text, the noise tolerance of pc1 depends upon the spacing
between 1 and all other eigenvalues of C. These can be calculated directly as
p
1 2 = 2 (br)2 + {m}2 (2.26a)
p
2 2
1 (m1 q) = (br) + {m} m + bq (2.26b)
p
1 (m2 q) = (br)2 + {m}2 + m + bq. (2.26c)

These quantities are depicted as a function of q in Figure S3C for a two-module, four gene
system.

The features that improve the noise tolerance of principal component recov-
ery Examination of the the principal value separations leads to two conclusions. First,
increasing the within-module covariance q increases the separation between 1 and i for
i > 2 (equations 2.26b, 2.26c). Secondly, this effect scales with the size of the gene expression
modules. While both the covariance term q and the variance term mi contribute to princi-
pal value separation, only the impact of the covariance term scales with block size b (as bq).
Hence for large modules (i.e. large b), the covariance terms may contribute significantly to

13
principal value separation. We conclude that gene expression modularity directly increases
the separation between 1 and all other principal values, and thus enhances the ability to
extract principal component 1 at low read depth. (We note that our model with covariance
q fixed for all blocks can be generalized to allow for differing covariance parameters within
blocks and still yields qualitatively similar results.)

Finally, the impact of module size on principal value separation can be seen directly in a
generalized model where more than two gene expression modules are allowed. In Figure S3A,
we analyze a series of covariance matrices where the number of genes is held constant, but the
number of gene expression modules is increased. In these calculations, the covariance terms
q and r are held constant (q = 40, r = 8) and mi span a constant range, 80 < mi < 200.
For constant q and r, the spacing between the largest eigenvalue and all others eigenvalues
(1 i ) increases as module size increases. A system with two modules has significantly
increased principal value separation when compared with even a four module system. Due
to this scaling, large gene expression modules might significantly enhance principal value
separation and therefore noise tolerance in biological data.

3 Supplemental Experimental Procedures


Alignment of sequencing reads and quantification of read counts for public
mRNA-seq datasets
Raw mRNA-seq reads were obtained from the Gene Expression Omnibus. mRNA-seq
datasets used in Figure S5A are from Shen et al. 2012; Treutlein et al. 2014; Shalek et al. 2013;
Kumar et al. 2014; Chen et al. 2012; Pollen et al. 2014. The reads from these studies were
aligned to either human hg19 or mouse mm9 exomes. Exome files were constructed based
upon the transcriptome annotations and gene feature files (gff) available from the UCSC
genome browser (Kent et al. 2002). Open reading frames encoding rDNA, transposable
elements, or other non-protein coding features were not included in the exome. Reads were
aligned to exomes using Bowtie2 v 2.1.0 (Langmead and Salzberg 2012) using the following
options -D 25 -R 3 -N 1 -L 20 -i S 1 0.50 local.

Following alignment, the data was preprocessed prior to analysis. This was accomplished
by normalizing raw per gene read counts by the total number of reads collected for a given
sample, ensuring that the normalized reads of one experiment sum to one.

For the analysis of Zeisel et al., we used the transcript counts reported on the Linnarsson
Lab website.

Simulated shallow sequencing through down-sampling of reads


A computational downsampling procedure was applied to simulate the impact of reduced
read depth on public mRNA-seq datasets (the Zeisel et al. data required a different method;
see the next experimental procedure). Read counts for the deep mRNA-seq experiments
were normalized by dividing the number of reads mapped to a gene by the total number of
mapped reads in that experiment, generating a multinomial probability mass function. For
a given simulated read depth, we model the sequencing process by drawing N reads, with
replacement, from this multinomial distribution.

14
We sample with replacement because the number of molecules within an mRNA-seq li-
brary, 1012 (McIntyre et al. 2011), is much larger than the number of reads being sequenced,
107 (Shen et al. 2012), effectively making each sequencing event independent of others. To
accelerate the computation at simulated depths over one million reads, read counts were esti-
mated directly with a Poisson distribution. Similar downsampling procedures are frequently
use to model read depth reductions and associated measurement noise (Robinson and Storey
2014; Pollen et al. 2014).

Simulated shallow sequencing through down-sampling of transcripts for Zeisel


et al. dataset
As the Zeisel et al. data contains unique molecular identifiers (UMIs) which allow for
the direct quantification of transcripts, the previously described downsampling procedure
was modified for this dataset. First, 15,000 transcripts were sampled with replacement for
each cell (as previously described) to obtain gene expression profiles with constant transcript
coverage per sample. We sampled 15,000 transcripts as that is approximately the average
number of unique transcripts observed per cell in Zeisel et al. To simulate low coverage data,
we sampled a desired number of reads from this reference distribution without replacement.

Saturating expression levels of outlying genes

Following downsampling, the largest 1% of all gene expression values (based on read
counts) were set to the value of the 99th percentile of the data. This saturation was performed
to diminish the impact of extreme outliers on subsequent data analysis as PCA is known to
be sensitive to such outliers. Following saturation, data was renormalized to preserve the
equal weighting of each experiment. We found in practice that outlier filtering was important
for preserving biological structure and in fact was required for biological replicates to cluster
together in the mouse tissue dataset. The saturation threshold for Kumar et al. 2014 was an
exception to the 1% threshold, it was set to 2.25% to ensure biological replicates clustered
together. Read counts were used as the fundamental gene expression unit in the analysis
for simplicity in theoretical modeling and convenience during the simulated down-sampling
procedure. Similar results were obtained in FPKM units where read counts are normalized
for gene length.

For the Zeisel et al. dataset, after downsampling and before principal components analy-
sis, we removed the top 15 varying genes. We found that this was necessary for recapitulating
the original studys classification of cell types at full transcript coverage.

Evaluation of Equation 1
Evaluation of Equation 1 requires the deep principal components, deep principal values,
and the deep and shallow data covariance matrices. The deep principal components and
principal values were determined for each dataset directly from the deep normalized read
count data. C was then calculated on read count data generated through the simulated
down-sampling procedure described above. At each read depth, Equation 1 was evaluated
on twenty separate instances of C, and the mean
principal component error was reported as
a percent of the theoretical maximum error ( 2).

15
Projecting gene expression profiles onto the principal components
Classification plots show the principal component coefficients for each gene expression
profile (from either a bulk mRNA-seq sample or single cell). These coefficients represent the
amount of variance along the axis defined by the respective principal component, or the pro-
jection of the expression profile onto a principal component. These coefficients are computed
by taking the dot product of the gene expression profile and the principal component. When
simulating low coverage mRNA-seq, the noisy, simulated gene expression profile is projected
onto the principal components computed from the noisy, simulated gene expression data.

Zeisel et al. cell type classification accuracy


For classification of single cells from Zeisel et al. at a simulated depth, each sampled
transcriptional profile was compared to three reference transcriptional profiles. The reference
transcriptional profiles were computed by averaging the full depth transcriptional profile of
each cell type as classified by Zeisel et al. Each downsampled cell was then assigned the cell
type label of the most similar reference profile. False positives correspond to mismatches
between the assigned cell type and the cell type from Zeisel et al. at full depth.

Cell type classification by nonlinear dimensionality reduction

Figures S2K and S2L were generated by downsampling data from Kumar et al. and
Shalek et al. as described above, followed by dimensionality reduction with t-SNE and LLE.
t-SNE was applied through the scikit-learn Python package version 2.7.6 (Pedregosa et al.
2011) and LLE was implemented following Roweis and Saul 2000.

Simple gene expression model


Simulated covariance matrices were generated for a system with six gene expression
modules. Module size was drawn from an exponential distribution and modules sorted for
increasing size. Within-module covariance was set to a uniform value (q, ranging from 10 to
80) and between-module covariance (r) was set to a constant for simplicity (-10).

Analysis and simulated downsampling of microarray data


Microarray data were downloaded from Gene Expression Omnibus (Edgar, Domrachev,
and Lash 2002). To minimize the effect of platform variability, one type of microarray
platform was selected for each species. We chose Affymetrix Yeast Genome S98 Array,
Affymetrix Mouse Genome 430 2.0 Array, and Affymetrix Human Genome U133 Plus 2.0
Array because they had the largest number of datasets containing at least 20 samples. Log-
transformed datasets were removed to ensure that each dataset was preprocessed in the
same way. After filtering, 20 datasets for Saccharomyces cerevisiae, 106 datasets for Mus
musculus and 226 for Homo sapiens remained. Each dataset was normalized so that the
gene expression values of each sample sum to one. Further, as with the mRNA-seq datasets,
we saturate expression levels at the 99% percentile to handle extreme expression outliers. To
generate simulated mRNA-seq data from microarray experiments, we used the normalized
and filtered gene expression matrix as input into our down-sampling procedure previously

16
described. The normalized gene expression matrix was used as a multinomial probability
distribution and this distribution was sampled to generate simulated mRNA-seq data at
different read depths.

Equation 2: fitting the constant and cross validation


To fit in Equation 2, we first partitioned the microarray data into a training set and a
cross validation set, each consisting of half the datasets. We simulated shallow sequencing
on the microarray data within the training set at ten values between 103 and 107 reads to
obtain PCA error for each dataset. The constant from the Equation 2 was determined
by fitting (with a linear regression) the simulated PCA error to the analytical prediction of
Equation 2.

To demonstrate that the relationship is predictive, we simulated shallow sequencing on


the remaining 50% of the microarray datasets at all ten depths and compared the predicted
values with those observed from simulation. We further used the four mRNA-seq datasets
for mouse and human (which were not used for fitting) as additional cross-validation. For
each species, only one value of is calculated globally, and this value is used for all principal
components, read depths, and datasets (for humans, = 71.25 and for mice, = 69.30).

Principal values and error in non-negative matrix factorization


Non-negative matrix factorization (NMF) was performed on normalized read count ma-
trices generated from microarray database. Three deep NMF vectors were computed
from the original data and three shallow NMF vectors were computed from simulated
mRNA-seq data with 45,000 reads. The pair of computations shared the same random ini-
tialized state. Each shallow NMF vector from the simulated shallow mRNA-seq data was
matched with a corresponding deep vector. Our algorithm determined matches by finding
the one-to-one mapping that minimized the summed squared differences between the deep
and corresponding shallow NMF vectors. The normalized error was computed as the magni-
tude of the difference between the matched NMF parts divided by the magnitude of the deep
NMF part. The sum of the normalized errors was computed for three different initialization
states. The median of the summed errors was used in Figure S5B.

Supplemental References
Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology
11.10.
Brennecke, P. et al. (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nature
Methods 10.11, pp. 10931095.
Chen, R. et al. (2012). Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes.
Cell 148.6, pp. 12931307.
Daley, T. and Smith, A. D. (2014). Modeling genome coverage in single-cell sequencing. Bioinformatics
30.22, pp. 31593165.
Ding, B. et al. (2015). Normalization and noise reduction for single cell RNA-seq experiments. Bioinfor-
matics 31 (13), pp. 22257.
Edgar, R., Domrachev, M., and Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene expression and
hybridization array data repository. Nucleic Acids Research 30.1, pp. 207210.
Grun, D., Kester, L., and van Oudenaarden, A. (2014). Validation of noise models for single-cell transcrip-
tomics. Nature Methods 11.6, pp. 637640.

17
Islam, S. et al. (2014). Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods
11.2, pp. 163166.
Kent, W. J. et al. (2002). The Human Genome Browser at UCSC. Genome Research 12.6, pp. 9961006.
Kumar, R. M. et al. (2014). Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature
516.7529, pp. 5661.
Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods 9.4,
pp. 357359.
Liu, Y., Zhou, J., and White, K. P. (2014). RNA-seq differential expression studies: more sequence or more
replication? Bioinformatics 30.3, pp. 301304.
Marioni, J. C. et al. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene
expression arrays. Genome research 18.9, pp. 15091517.
McIntyre, L. M. et al. (2011). RNA-seq: technical variability and sampling. BMC Genomics 12.1, p. 293.
Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Re-
search 12, pp. 28252830.
Pollen, A. A. et al. (2014). Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and
activated signaling pathways in developing cerebral cortex. Nature Biotechnology 32.10, pp. 10531058.
Robinson, D. G. and Storey, J. D. (2014). subSeq: determining appropriate sequencing depth through
efficient read subsampling. Bioinformatics 30.23, pp. 34243426.
Roweis, S. T. and Saul, L. K. (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding.
Science 290.5500, pp. 23232326.
Shalek, A. K. et al. (2013). Single-cell transcriptomics reveals bimodality in expression and splicing in
immune cells. Nature 498.7453, pp. 23640.
Shankar, R. (2012). Principles of Quantum Mechanics. Springer Science & Business Media.
Shen, Y. et al. (2012). A map of the cis-regulatory sequences in the mouse genome. Nature 488.7409,
pp. 116120.
Shiroguchi, K. et al. (2012). Digital RNA sequencing minimizes sequence-dependent bias and amplification
noise with optimized single-molecule barcodes. Proceedings of the National Academy of Sciences 109.4,
pp. 13471352.
Stewart, G. W. and Sun, J. (1990). Matrix Perturbation Theory. Academic Press.
Tarazona, S. et al. (2011). Differential expression in RNA-seq: A matter of depth. Genome Research 21.12,
pp. 22132223.
Treutlein, B. et al. (2014). Reconstructing lineage hierarchies of the distal lung epithelium using single-cell
RNA-seq. Nature 509.7500.
Tropp, J. A. (2011). User-Friendly Tail Bounds for Sums of Random Matrices. Foundations of Computa-
tional Mathematics 12.4, pp. 389434.
Vallejos, C. A., Marioni, J. C., and Richardson, S. (2015). BASiCS: Bayesian Analysis of Single-Cell Se-
quencing Data. PLoS Comput Biol 11.6, e1004333.

18

Das könnte Ihnen auch gefallen