Multivariate Data Analysis and Visualization Tools

Multivariate Data Analysis and Visualization
Tools for Understanding Biological Data

Dmitry Grapov
Introduction: Systems
Emergent
Graph theory
Complex
systems
Informatics
Modeling
Physiology
Biochemistry Systems
Chemical analysis
Deterministic Reductionist
Oltvai, et al. Science 25 October 2002: 763-764.

Introduction: Inference
Overview
Types: Univariate Bivariate Multivariate
1-D 2-D n-D
Properties: vector matrix matrix
Representations: histograms scatter plots dendrograms

densities heatmaps
biplots
networks
Central Idea: mean correlation many
http://www.thefullwiki.org/Hypercube
Univariate: Properties
vector of length m
mean
variance
Univariate: Representations
Univariate: Assumptions
Normality
Univariate: Utility
Hypothesis testing
- type I error ( False Positive)
- type II error ( False negative)
power - (1)
effect size - standardized difference in mean
Univariate: Limitations
Biological definition of the mean ?
Relationship between sample size and test power
Multiple hypothesis testing
False discovery rate
Old Faithful Data
272 observations
time between eruptions
70 14 min
duration of eruption
3.5 1 min
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357365
Bivariate: Properties
Matrix of 2 vectors of length m
Bivariate: Representations
(X,Y)
Bivariate: Utility
bivariate distribution
correlation
(X,Y)
Variable 2 = m*Variable 1 + b
Bivariate: Limitations
Measure of correlation coefficient

linear or
monotonic
relationship
http://en.wikipedia.org/wiki/Correlation
Bivariate: Limitations
Sensitive to outliers
http://en.wikipedia.org/wiki/Correlation
Old Faithful
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357365
Old Unfaithful?
Old Unfaithful?
Additional
variables
Nearby
hydrofracking
Improve
inference
based on
more
information
Old Unfaithful?
Additional
variables
Nearby
hydrofracking
Improve
inference
based on
more
information
Multivariate: Properties
Challenges
A matrix of n vectors of length m
data often wide structured
integration
noise
Rewards
robust inference
signal amplification
Correlation matrix
holistic/systems approach
Multivariate: Dimensional Reduction
Principal Components Analysis (PCA)
Linear n-dimensional encoding of original data

Where dimensions are:
1. orthogonal (uncorrelated)
2. Top k dimensions are ordered by variance explained
PC 1
PC 2
Multivariate: Dimensional Reduction
Calculating PCs: singular value decomposition
(SVD)
Eigenvalue
Original Data Explained Loadings explained variance

Scores
variance
Scores
sample
representation based
on all variables
PC x PC n x PC Loadings
variable contribution
to scores
m x PC
Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in A
Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA
(2003). LANL LA-UR-02-4001.
Multivariate: Representations
Old Faithful 2.0
272 measurements A matrix of n vectors of length m
8 variables
2 real, 6 random noise
Multivariate: Representation
Number of PCs
can be used true
data complexity
Identify Identify
outliers interesting
using all measurements
groups Evaluate uni-toand
Use known bivariate
impute missing
observations
PCA: Considerations
data pre-treatment
outliers no pre-
treatment
noise
unsupervised projection
centered and
scaled to unit
variance
PCA: Considerations
Use ICA to calculate statistically
data pre-treatment independent components
outliers
linear reconstruction
noise
Independent components analysis
(ICA)
unsupervised projection
PCA: Considerations
data pre-treatment
outliers
noise
supervised projection
Non-negative matrix factorization
(NMF)
NMF uses additive parts based encoding
Learning the parts of objects by nonnegative matrix factorization,

D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
PCA: Considerations
data pre-treatment
outliers
noise
supervised projection
Identify projection correlated with
class assignment (classification) or
continuous variables (regression)
Partial Least Squares Projection to

Latent Structures (PLS/-DA)
PLS/-DA: Utility
Strengths
Predict multiple dependent variables
avoids issues of multicollinearity
Independent measure of variable importance
Weaknesses
Need to derive an empirical reference for model performance
Poor established model optimization methods
PLS-DA: Example
Data: Old Faithful 2.0
Select the appropriate number Latent
272 observations on 8 Variables (LVs) to maximize Q2
variables
Latent Variables are

analogous to PCs
Important Statistics (CV)
Q2 = fit
RMSEP = error of
prediction
AU(RO)C = specificity vs.
sensitivity
PLS-DA: Performance
Use permutation tests to empirically determine model
performance
PLS-DA: Performance
Use permutation tests to empirically determine model
performance
PLS: Predictive Performance
Split data into training
(2/3) and test sets (1/3)
Generate model using
training set and then
predict class assignment
for test set
Use permutation tests to
generate confidence
bounds for future
predictions
PLS: Predictive Performance
PLS: Feature Selection
Use the PLS-DA as an objective function to identify the
most informative variables
Networks
Network: representation of relationships among objects
Utility
Project statistical results into a biological context
Explore informative data aspects in the context of all that was
observed.
Identify emergent patterns
Networks
Interpret statistical results
within a biological context
Networks
Highlight changes in patterns of relationships.
non-diabetics type 2 diabetics

Networks
Display complex interactions

imDEV: interactive modules for Data Exploration and Visualization
An integrated environment for systems level analysis of
multivariate data.
http://sourceforge.net/apps/mediawiki/imdev
Acknowledgements
Newman Lab
Designated
Emphasis in
Biotechnology (DEB)
NIH
This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS
5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.

Multivariate Data Analysis and Visualization Tools

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Multivariate Data Analysis and Visualization Tools

Hochgeladen von

Copyright:

Verfügbare Formate

Multivariate Data Analysis and Visualization

Tools for Understanding Biological Data

Oltvai, et al. Science 25 October 2002: 763-764.

Properties: vector matrix matrix

Representations: histograms scatter plots dendrograms

Central Idea: mean correlation many

Measure of correlation coefficient

Linear n-dimensional encoding of original data

Original Data Explained Loadings explained variance

Learning the parts of objects by nonnegative matrix factorization,

Partial Least Squares Projection to

Latent Variables are

non-diabetics type 2 diabetics

non-diabetics type 2 diabetics

non-diabetics type 2 diabetics

Das könnte Ihnen auch gefallen