Beruflich Dokumente
Kultur Dokumente
2 (2018): e100041
Review
Publisher: Kernel Press. Copyright
c (2018) the Author(s). This is an Open Access article distributed under the
terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
GENOMICS AND COMPUTATIONAL BIOLOGY Vol. 4 No. 2 (2018): e100041
ASSUMPTIONS OF PCA
2
GENOMICS AND COMPUTATIONAL BIOLOGY Vol. 4 No. 2 (2018): e100041
3
GENOMICS AND COMPUTATIONAL BIOLOGY Vol. 4 No. 2 (2018): e100041
4
GENOMICS AND COMPUTATIONAL BIOLOGY Vol. 4 No. 2 (2018): e100041
Gene expression levels have been shown to be PC1 and PC2 together only accounted for 24.03%
heteroscedastic, meaning that their variance changes of the total variance which means that additional
depending on mean expression level. Usually, genes components are important and may be associated with
with lower expression levels are associated with higher multivariate patterns. However, our goal here is not
variability across different measurements [8]. The to provide a complete analysis of the gene expression
Poisson distribution which is often applied to modelling data set, but more to demonstrate how PCA can be
count data tends to underestimate the variance in applied to this type of data for exploratory purposes.
gene expression data [8]. In contrast, the negative Since a multivariate pattern already emerged based on
binomial (NB) distribution has been shown to provide a PC1 and PC2, we did not further evaluate the remaining
good fit when modelling differential gene expression [9]. components.
The NB model assumes that an observation has a The R script used to perform the PCA analysis on
population mean µ and variance σ 2 = µ + φµ2 , gene expression data is included in Supplementary File
where φ is the so-called dispersion parameter [9]. To 2.
account for the heteroscedasticity of the input data, we
estimated the dispersion parameter for each gene with
the help of the DESeq R package. Afterwards, we
applied the varianceStabilizingTransformation() function
to the input data in order to produce a data matrix in ACKNOWLEDGEMENTS
which expression levels are homoscedastic. Finally, we
extracted the principal components from the transformed
data using the plotPCA() function. The plotPCA() The work of SG and DF was funded by the Center for
function internally calls on prcomp(), meaning that Computational Sciences in Mainz (CSM). The work of
PCA was performed by singular value decomposition of HT was funded by Fresenius Kabi Deutschland GmbH.
5
GENOMICS AND COMPUTATIONAL BIOLOGY Vol. 4 No. 2 (2018): e100041
Figure 3: PCA analysis on gene expression data. PCA was performed on transformed gene expression data from a mouse
model. PCA identified two distinct clusters in the data separated along the first principal component. Different colouring schemes
revealed that clustering was associated with the mouse line (a). Gender was not associated with a pattern in the data (b), but
genotype revealed a gradient of discrimination along the second principal component (c). The amount of variance in percent
accounted for by each PC is included in brackets.
script used to process the gene expression data and 6. Horn JL. A rationale and test for the number of factors in factor
analysis. Psychometrika. 1965;30(2):179–185.
perform PCA. doi:10.1007/BF02289447.
7. Maung R, Hoefer MM, Sanchez AB, Sejbuk NE, Medders KE, Desai
MK, et al. CCR5 Knockout Prevents Neuronal Injury and
ABBREVIATIONS Behavioral Impairment Induced in a Transgenic Mouse Model
by a CXCR4-Using HIV-1 Glycoprotein 120. The Journal of
F : female Immunology. 2014;193(4):1895–1910.
doi:10.4049/jimmunol.1302915.
GEO : gene expression omnibus 8. Anders S, McCarthy D, Chen Y, Okoniewski M, Smyth G, Huber W,
6
GENOMICS AND COMPUTATIONAL BIOLOGY Vol. 4 No. 2 (2018): e100041