Sie sind auf Seite 1von 41

Multivariate Data Analysis and Visualization

Tools for Understanding Biological Data


Dmitry Grapov
Introduction: Systems
Emergent

Graph theory
Complex
systems
Informatics

Modeling

Physiology
Biochemistry Systems

Chemical analysis

Deterministic Reductionist

Oltvai, et al. Science 25 October 2002: 763-764.


Introduction: Inference
Overview
Types: Univariate Bivariate Multivariate
1-D 2-D n-D

Properties: vector matrix matrix

Representations: histograms scatter plots dendrograms


densities heatmaps
biplots
networks

Central Idea: mean correlation many

http://www.thefullwiki.org/Hypercube
Univariate: Properties
vector of length m
mean
variance
Univariate: Representations
Univariate: Assumptions

Normality
Univariate: Utility
Hypothesis testing
- type I error ( False Positive)
- type II error ( False negative)
power - (1)
effect size - standardized difference in mean
Univariate: Limitations
Biological definition of the mean ?
Relationship between sample size and test power
Multiple hypothesis testing
False discovery rate
Old Faithful Data
272 observations
time between eruptions
70 14 min
duration of eruption
3.5 1 min

Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357365
Bivariate: Properties
Matrix of 2 vectors of length m
Bivariate: Representations

(X,Y)
Bivariate: Utility
bivariate distribution
correlation

(X,Y)

Variable 2 = m*Variable 1 + b
Bivariate: Limitations

Measure of correlation coefficient


linear or
monotonic
relationship

http://en.wikipedia.org/wiki/Correlation
Bivariate: Limitations

Sensitive to outliers

http://en.wikipedia.org/wiki/Correlation
Old Faithful

Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357365
Old Unfaithful?
Old Unfaithful?
Additional
variables

Nearby
hydrofracking

Improve
inference
based on
more
information
Old Unfaithful?
Additional
variables

Nearby
hydrofracking

Improve
inference
based on
more
information
Multivariate: Properties
Challenges
A matrix of n vectors of length m
data often wide structured
integration
noise

Rewards
robust inference
signal amplification
Correlation matrix
holistic/systems approach
Multivariate: Dimensional Reduction
Principal Components Analysis (PCA)

Linear n-dimensional encoding of original data


Where dimensions are:
1. orthogonal (uncorrelated)
2. Top k dimensions are ordered by variance explained

PC 1

PC 2
Multivariate: Dimensional Reduction
Calculating PCs: singular value decomposition
(SVD)
Eigenvalue

Original Data Explained Loadings explained variance


Scores
variance
Scores
sample
representation based
on all variables
PC x PC n x PC Loadings
variable contribution
to scores
m x PC

Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A
Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA
(2003). LANL LA-UR-02-4001.
Multivariate: Representations
Old Faithful 2.0
272 measurements A matrix of n vectors of length m
8 variables
2 real, 6 random noise
Multivariate: Representation

Number of PCs
can be used true
data complexity

Identify Identify
outliers interesting
using all measurements
groups Evaluate uni-toand
Use known bivariate
impute missing
observations
PCA: Considerations
data pre-treatment
outliers no pre-
treatment
noise
unsupervised projection

centered and
scaled to unit
variance
PCA: Considerations
Use ICA to calculate statistically
data pre-treatment independent components

outliers
linear reconstruction
noise
Independent components analysis
(ICA)
unsupervised projection
PCA: Considerations
data pre-treatment
outliers
linear reconstruction
noise
supervised projection
Non-negative matrix factorization
(NMF)
NMF uses additive parts based encoding

Learning the parts of objects by nonnegative matrix factorization,


D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
PCA: Considerations
data pre-treatment
outliers
linear reconstruction
noise
supervised projection
Identify projection correlated with
class assignment (classification) or
continuous variables (regression)

Partial Least Squares Projection to


Latent Structures (PLS/-DA)
PLS/-DA: Utility
Strengths
Predict multiple dependent variables
avoids issues of multicollinearity
Independent measure of variable importance

Weaknesses
Need to derive an empirical reference for model performance
Poor established model optimization methods
PLS-DA: Example
Data: Old Faithful 2.0
Select the appropriate number Latent
272 observations on 8 Variables (LVs) to maximize Q2
variables

Latent Variables are


analogous to PCs
Important Statistics (CV)
Q2 = fit
RMSEP = error of
prediction
AU(RO)C = specificity vs.
sensitivity
PLS-DA: Performance
Use permutation tests to empirically determine model
performance
PLS-DA: Performance
Use permutation tests to empirically determine model
performance
PLS: Predictive Performance
Split data into training
(2/3) and test sets (1/3)
Generate model using
training set and then
predict class assignment
for test set
Use permutation tests to
generate confidence
bounds for future
predictions
PLS: Predictive Performance
PLS: Feature Selection
Use the PLS-DA as an objective function to identify the
most informative variables
Networks
Network: representation of relationships among objects

Utility
Project statistical results into a biological context
Explore informative data aspects in the context of all that was
observed.
Identify emergent patterns
Networks
Interpret statistical results
within a biological context
Networks
Highlight changes in patterns of relationships.

non-diabetics type 2 diabetics


Networks
Display complex interactions

non-diabetics type 2 diabetics


imDEV: interactive modules for Data Exploration and Visualization
An integrated environment for systems level analysis of
multivariate data.

non-diabetics type 2 diabetics

http://sourceforge.net/apps/mediawiki/imdev
Acknowledgements

Newman Lab

Designated
Emphasis in
Biotechnology (DEB)

NIH

This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS
5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.

Das könnte Ihnen auch gefallen