Sie sind auf Seite 1von 61

Texas A&M

GSPLab

Data Mining

Edward R. Dougherty
Department of Electrical and Computer Engineering
Center for Bioinformatics and Genomic Systems Engineering
Texas A&M University

gsp.tamu.edu

Texas A&M
GSPLab

Reading
Book: Chapter 8
Papers: Paper: Dougherty, E. R., Prudence, Risk, and
Reproducibility in Biomarker Discovery,
BioEssays, Vol. 34, No. 4, 277-279, 2012.

03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Knowledge Discovery
Knowing the constitution of scientific knowledge
and how to validate it leaves open the question of
how to discover knowledge.
Obviously, we need to observe Nature, but in what
manner.

03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Bacon on Planned Experiments


Francis Bacon (Novum Organum, 1620):
There remains simple experience which, if
taken as it comes, is called accident; if sought
for, experiment. But this kind of experience
isa mere groping, as of men in the dark
But the true method of experience, on the
contrary, first lights the candle, and then by
means of the candle shows the way;
commencing as it does with experience duly
ordered and digested, not bungling or erratic,
and from it educing axioms, and from
established axioms again new experiments.

gsp.tamu.edu

Texas A&M
GSPLab

Experimental Design: The Path of Progress


Immanuel Kant (Critique of Pure Reason,1781): It is
only when experiment is directed by rational principles
that it can have any real utility. Reason must approach
nature with the view, indeed, of receiving information
from it, not, however, in the character of a pupil, who
listens to all that his master chooses to tell him, but in
that of a judge, who compels the witnesses to reply to
those questions which he himself thinks fit to propose.
To this single idea must the revolution be ascribed, by
which, after groping in the dark for so many centuries,
natural science was at length conducted into the path of
certain progress.

gsp.tamu.edu

Texas A&M
GSPLab

Judicious Feature Selection


James Clerk Maxwell: The feature which
presents itself most forcibly to the untrained
inquirer may not be that which is considered
most fundamental by the experienced man of
science; for the success of any physical
investigation depends on the judicious
selection of what is to be observed as of
primary importance, combined with a
voluntary abstraction of the mind from those
features which, however attractive they
appear, we are not yet sufficiently advanced in
science to investigate with profit.

gsp.tamu.edu

Texas A&M
GSPLab

An Experiment is a Question
Hans Reichenbach (Rise of Scientific
Philosophy): An experiment is a question
addressed to Nature.As long as we
depend on the observation of occurrences
not involving our assistance, the
observable happenings are usually the
product of so many factors that we cannot
determine the contribution of each
individual factor to the total result.

03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Reasoning to Science
Hans Reichenbach: By means of the
artificial occurrences of planned
experiments, the complex occurrence of
Nature is thus analyzed into its
components. That Greek science did not
use experiments in any significant way
proves how difficult it was to turn from
reasoning to empirical science.
Science is not constituted by reasoning about
data; it is constituted by pragmatic, predictive
models.
03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Foolish Questions Yield Foolish Answers


Arturo Rosenblueth and Norbert
Wiener: An experiment is a question. A
precise answer is seldom obtained if the
question is not precise; indeed, foolish
answers i.e., inconsistent, discrepant or
irrelevant experimental results are
usually indicative of a foolish question.

03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Models Depend on Questions Asked


Werner Heisenberg: The most important new
result of nuclear physics was the recognition of
the possibility of applying quite different types
of natural laws, without contradiction, to one
and the same physical event. This is due to the
fact that within a system of laws which are
based on certain fundamental ideas only certain
quite definite ways of asking questions make
sense, and thus, that such a system is separated
from others which allow different questions to
be put.

gsp.tamu.edu

Texas A&M
GSPLab

Mere Observation
Hannah Arendt: [Natural science]
seemed to be liberated by the discovery that
our senses by themselves do not tell the
truth. Henceforth, sure of the unreliability
of sensation and the resulting insufficiency
of mere observation, the natural sciences
turned toward the experiment, which, by
directly interfering with nature, assured the
development whose progress has ever since
appeared to be limitless.

gsp.tamu.edu

Texas A&M
GSPLab

Answers Without Questions


Hannah Arendt: The experiment being a
question put before nature (Galileo), the
answers of science will always remain
replies to questions asked by men; the
confusion in the issue of objectivity was to
assume that there could be answers without
questions and results independent of a
question-asking being.

gsp.tamu.edu

Texas A&M
GSPLab

Efficient Experimentation
Douglas Montgomery: If an experiment is
to be performed most efficiently, then a
scientific approach to planning the
experiment must be considered. By the
statistical design of experiments we refer to
the process of planning the experiment so
that appropriate data will be collected, which
may be analyzed by statistical methods
resulting in valid and objective conclusions.
The statistical approach to experimental
design is necessary if we wish to draw
meaningful conclusions from the data.

gsp.tamu.edu

Texas A&M
GSPLab

Everyday Classification
Some algorithm is proposed.
The algorithm separates some data set.
We are not told the distribution from which the data come.

An estimation rule is used to estimate the error.


We are given no reason why the estimate should be good.
In fact, often we expect that the estimate is not good.

The estimate is small and the algorithm is claimed to


be validated.
We are given no justification for the claim.
We are given no conditions under which it is valid.

03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Data Mining Definition 1


(Merriam-Webster) Type of database analysis that
attempts to discover useful patterns or relationships in
a group of data. The analysis uses advanced statistical
methods, such as cluster analysis, and sometimes
employs artificial intelligence or neural network
techniques. A major goal of data mining is to discover
previously unknown relationships among the data,
especially when the data come from different
databases.
Relations among data, not among variables no science!

gsp.tamu.edu

Texas A&M
GSPLab

Data Mining Definition 2


(Wikipedia) Data analysis has increasingly been
augmented with indirect, automated data processing,
aided by other discoveries in computer science, such as
neural networks, cluster analysis, genetic algorithms, and
support vector machines. Data mining is the process of
applying these methods with the intention of uncovering
hidden patterns in large data sets.
Uncovering patterns in data sets no science!

gsp.tamu.edu

Texas A&M
GSPLab

Data Mining
Data mining is a return to pre-Baconian groping, albeit, at
a much faster groping rate than was then possible.
It suffers from three debilitating properties:
It does not ask precise questions.
There is no statistical characterization of the procedure.
As opposed to pattern recognition, it lacks a characterization of
prediction in the context of a distribution.

Sometimes it is justified by large sample theory,


typically absent a rigorous analysis to the problem at hand.
03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Statistics for the Proletariat


Julian L. Simon (Resampling: The New Statistics,
1997): Monte Carlo resampling simulation takes the
mumbo-jumbo out of statistics and enables even
beginning students to understand completely everything
that is done. Even many experts are unable to
understand intuitively the formal mathematical approach
to the subject. Clearly, we need a method free of the
formulas that bewilder almost everyone.
Everyday common sense should replace the mumbo-jumbo
of scientific rigor and, to a great extent, it has.

gsp.tamu.edu

Texas A&M
GSPLab

The Numbers Speak for Themselves


Chris Anderson (The End of Theory: The Data Deluge
Makes the Scientific Method Obsolete): The more we
learn about biology, the further we find ourselves from a
model that can explain it. There is now a better way.
Petabytes allow us to say: "Correlation is enough." We
can stop looking for models. We can analyze the data
without hypotheses about what it might show. We can
throw the numbers into the biggest computing clusters
the world has ever seen and let statistical algorithms find
patterns where science cannot With enough data, the
numbers speak for themselves.

Texas A&M
GSPLab

Consistency (Asymptotic Convergence)


For a sample S of size n, there is a design cost: n = n
Bayes.
A classification rule is consistent if E[n] 0 as n .
An error estimator is consistent if the estimate converges
to the true error as n .
What good is this for small samples?

gsp.tamu.edu

Texas A&M
GSPLab

Asymptotic Convergence is Irrelevant


Appeal to laws of large numbers or central limit
theorems in small-sample settings is unwarranted.
Training-data-based error estimation methods, such as
cross-validation and bootstrap, converge asymptotically
as the sample size goes to infinity, but this is of
virtually no value for small samples.

03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Asymptopia
Edward Leamer: Two of the latest products-toend-all-suffering are nonparametric estimation
and consistent standard errors, which promise
results without assumptions, as if we were
already in Asymptopia where data are so plentiful
that no assumptions are needed By
disguising the assumptions on which nonparametric
methods and consistent standard errors rely, the purveyors
of these methods have made it impossible to have an
intelligible conversation about the circumstances in which
their gimmicks do not work well and ought not to be used.
As for me, I prefer to carry parameters on my journey so I
know where I am and where I am going, not travel stoned
on the latest euphoria drug.

gsp.tamu.edu

Texas A&M
GSPLab

Tackling Small Sample Problems


Ronald A. Fisher (1925): Little experience
is sufficient to show that the traditional [large
sample] machinery of statistical processes is
wholly unsuited to the needs of practical
research. Not only does it take a cannon to
shoot a sparrow, but it misses the sparrow!...
The elaborate mechanism built on the theory of infinitely
large samples is not accurate enough for simple laboratory
data. Only by systematically tackling small sample problems
on their merits does it seem possible to apply accurate tests to
practical data.
03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Full Knowledge is in Sampling Distribution


Harald Cramer, 1946: It is clear that a
knowledge of the exact form of a sampling
distribution would be of a far greater value
than the knowledge of a number of moment
characteristics or a limiting expression for
large values of n. Especially when we are
dealing with small samples, as is often the
case in the applications, the asymptotic
expressions are sometimes grossly inadequate,
and a knowledge of the exact form of the
distribution would then be highly desirable.

gsp.tamu.edu

Texas A&M
GSPLab

Humean Trap: Data Without Reason


Hume (Treatise of Human Nature): The
mind is a kind of theatre, where several
perceptions successively make their
appearance; pass, repass, glide away, and
mingle in an infinite variety of postures
and situations. There is properly no
simplicity in it at one time, nor identity in
different [times].
A definition of radical empiricism.
Data mining.
03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Necessity of an Intelligent Idea


William Barrett (Illusion of
Technique): The absence of an
intelligent idea in the grasp of a
problem cannot be redeemed by the
elaborateness of the machinery Ione
subsequently employs.

gsp.tamu.edu

Texas A&M
GSPLab

The Imprint of Mind


William Barrett (Illusion of Technique): The scientists
mind is not a passive mirror that reflects the facts as they
are in themselves (whatever that might mean); the
scientist constructs models, which are not found among
the things given him in his experience, and proceeds to
I And he must often
impose those models upon Nature.
construct those models conceptually before they are
translated at any point into the material constructions of
his apparatus in the laboratory.The imprint of mind is
everywhere on the body of this science, and without the
founding power of mind it would not exist.

gsp.tamu.edu

Texas A&M
GSPLab

Radical Empiricism Denies Knowledge


Hans Reichenbach (Rise of Scientific
Philosophy): A mere report of relations
observed in the past cannot be called
knowledge. If knowledge is to reveal
objective relations of physical objects, it
must include reliable predictions. A radical
empiricism, therefore, denies the
possibility of knowledge.
A collection of measurements, together with
statements about the measurements, is not
scientific knowledge.
03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

A Huge Challenge
Janet Woodcock (Director, Center for Drug
Evaluation and Research, FDA): [As much as
75 percent of published biomarker associations
are not replicable] This poses a huge
challenge for industry in biomarker
identification and diagnostics development.
Dougherty, E. R., Prudence, Risk, and Reproducibility in
Biomarker Discovery, BioEssays, 34(4), 277-279, 2012.
Yousefi, M., and E. R. Dougherty, Performance Reproducibility
Index for Classification, Bioinformatics, 28(21), 2824-2833,
2012.
03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Reporting Bias When Using Real Data


m data sets of size n, LDA with 10-fold cross-validation
est(0) and true(0) are the estimated and true errors for the
sample with the lowest error estimate, and E[true] is
expected true error over all samples.
Left: est(0) true(0); right: est(0) E[true]; n = 60, 120.

Yousefi, M. R., Hua, J., Sima, C., and E. R. Dougherty, Reporting Bias When Using Real
Data Sets to Analyze Classification Performance, Bioinformatics, 26 (1),
( 68-76, 2010.

03/28/15

Texas A&M
GSPLab

Multiple-Rule Bias
Use r classification rules and s error
estimation rules. Select the pair with
the minimum estimated error, min,est...
Bias(m) = E[min,est true(imin)], over
sampling distribution, m = rs, n = 60.

Yousefi, M. R., Hua, J., and E. R. Dougherty, MultipleRule Bias in the Comparison of Classification Rules,
Bioinformatics, 27(12), 1675-1683, 2011.

Texas A&M
GSPLab

Reproducibility Performance Index


A preliminary study of size n is reproducible with
accuracy 0 if n nest + .
A follow-on study will be performed if nest .
Rn(, ) = P(n nest + | nest ).
Real data sets: LDA, n = 60, 5 features by t-test.

Yousefi, M., and E. R. Dougherty, Performance Reproducibility Index for


Classification, Bioinformatics, 28(21), 2824-2833, 2012.

gsp.tamu.edu

Texas A&M
GSPLab

Reproducibility with Reporting Bias


Reproducibility index for m = 5 data sets, LDA, 5F-CV,
5 features, Gaussian with equal covariance matrices,
uncorrelated features
(a) n = 60, = 0.0005; (b) n = 60, = 0.05;
(c) n = 120, = 0.0005; (d) n = 120, = 0.05;

03/28/15

gsp.tamu.edu

Texas A&M
GSPLab

Separate Sampling: Classifier Error


Class sizes, n0 and n1, pre-determined
Hence, no estimate for class prior
probability c = P(Y = 0).
Random sampling, r = n0/n c, n (prob)

Fix c, expected error for r (QDA)


Dark blue (c =0.3), black (c = 0.4), light
blue (c = 0.5), red (c = 0.6), green (c = 0.7)
r* (crossing point) is minimax value
Top equal covariance; bottom unequal
covariance
Esfahani, M. S., and E. R. Dougherty, Effect of
Separate Sampling on Classification Accuracy,
Bionformatics,

gsp.tamu.edu

Texas A&M
GSPLab

Separate Sampling: Error Estimation


Class sizes, n0 and n1, pre-determined.
Apply classical 5-fold cross-validation on
the data set to estimate the error (dashed
lines).
Apply separate-sampling 5-fold crossvalidation (solid lines).

Fix c, Bias for r (L-SVM).


Dark blue (c =0.3), black (c = 0.4), light blue
(c = 0.5), red (c = 0.6), green (c = 0.7)
Top n = 80; bottom n = 1000
Braga-Neto, U. L., Zollanvari, U. M., and E. R.
Dougherty, Cross-Validation Under Separate
Sampling: Optimistic Bias and How to Correct It,

Texas A&M
GSPLab

Apparent Patterns in Microarray Data


Relationship?

time course or
experiments

patterns

genes

Texas A&M
GSPLab

What Does This Mean?


Data are clustered
by some clustering
algorithm.
Is there scientific
knowledge here?

Texas A&M
GSPLab

Clustering Algorithm
An algorithm that partitions a set of points into several
groups, based on a measure of similarity (or
dissimilarity) between the points.
Example:
x
3

Group 1
Group 2

Group 3

x2

x1

Texas A&M
GSPLab

Expression Profile Clustering


Cluster expression vectors: clusters indicate
potential co-regulation in time-course data analysis.
Cluster samples: clusters indicate potential similar
sources a sort of classification.
Methods

Fuzzy c-means
K-means
S.O.M.
Hierarchical clustering (Euclidean distance)
Hierarchical clustering (correlation)

Texas A&M
GSPLab

K-means Clustering
Goal: Partition points into tight clusters.
Algorithm:
Randomly initialize with k means m1,, mk
Place x into Ci if ||x mi|| ||x mj|| for j = 1,, k
Update m1,, mk as the means of C1,, Ck
Repeat until means do not change
Clusters determined by Voronoi diagram of m1,, mk

Texas A&M
GSPLab

Hierarchical Clustering
Iteratively join clusters based on similarity measure
(agglomerative clustering).
Farthest neighbor similarity measure:
d(Ci, Cj) = max {||x y|| : x Ci, y Cj}

Algorithm (complete linkage clustering):


Initialize clusters by Ci = {xi} for i = 1,, n
Iteratively merge the clusters for which the greatest distance
between points in the two clusters is minimized
Halts when the similarity measure exceeds a pre-defined
threshold

Texas A&M
GSPLab

Hierarchical Clustering Example

A. cholesterol biosynthesis

B. cell cycle
C. immediate-early response
D. signaling and angiogenesis
E. wound healing and tissue remodeling
Source: Michael B. Eisen, et
al., PNAS 1998, Vol.95

Texas A&M
GSPLab

The Clustering Problem


Jain et al.: Clustering is a subjective process; the
same set of data items often needs to be partitioned
differently for different applications.
Jain, A.K., Murty, M. N., and P.J. Flynn, Data Clustering: A Review,
ACM Computer Surveys, 31 (3), 264-323, 1999.

Solution
Mathematical theory
Pattern recognition theory and random set theory

Texas A&M
GSPLab

What Are Good Clusters?


Example:
- 2 or 3 clusters?
- What is the best separation?

Texas A&M
GSPLab

Naive Clustering Error


Generate set of points from different distributions: A1,, Ak.
Use clustering algorithm to form clusters: C1,, Ck.
Align point sets and clusters, and count errors.
Average over a number of randomly generated sets.

Dougherty, E. R. , Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y.,

Bittner, M. L., and J. M. Trent, "Inference From Clustering with Application to


Gene-Expression Microarrays," Computational Biology 9 (1), 105-126, 2002.

Texas A&M
GSPLab

Synthetic Example
5 synthetic
templates
Simulated data
from the templates
different variances

5 different
clustering methods

Texas A&M
GSPLab

Single Experiment ( 2 = 0.25)


No error!

Tighter clusters due


to small variance

Results from fuzzy


c-means

Texas A&M
GSPLab

Experiment ( 2 = 3.0)
many
misclassifications

clusters start mixing

22 misclassifications
(8.8%)

Texas A&M
GSPLab

Hierarchical Clustering Error!!!

Before clustering

After clustering with a


NICE dendrogram

24.5% Error!!

Algorithm: Hierarchical clustering with correlation measure

Texas A&M
GSPLab

Clustering Error
Points are a realization S of a labeled random point
process.
Clustering algorithm assigns to S a label function S.
The error of is the expected difference between its
labels and the labels generated by the point process.
Error must take into account that we do not care about
the ordering, only the partitions generated.
Expectation taken with respect to the distribution of the
point process.

Texas A&M
GSPLab

Example of Clustering Error


Left: Realization of point process
Right: Output of hierarchical clustering
Error: 40%

Texas A&M
GSPLab

Clustering Validity
Clustering validity is analogous to classification
validity.
Replace classifier with cluster operator and
classification error with clustering error.

Texas A&M
GSPLab

Validation Indices
Validation indices are meant to judge the validity of a
clustering output.
They can be based on a number of heuristic
considerations and methodologies.
Do they correspond to scientific validity?
Does a validation index correlate to clustering error?
Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and E. R.
Dougherty, Model-Based Evaluation of Clustering Validation Measures,
Pattern Recognition, 40 (3), 807-824, 2007.

Texas A&M
GSPLab

Kendalls Correlation for Indices


Top: Realization of
point process
Bottom: Kendalls
correlation:
Dunns index, D correl,
silhouette, figure of merit

Texas A&M
GSPLab

Kendalls Correlation for Indices


Top: Realization of point
process
Bottom: Kendalls
correlation:
Dunns index, D correl,
silhouette, figure of merit

Texas A&M
GSPLab

Scientific Knowledge
Requires a mathematical model.
In classification, the model is learned from training data.

Requires a methodology to test the model.


Can inferences be made from the model?

Texas A&M
GSPLab

Classification and Knowledge


model is composed of a classifier (decision
The
function) and an error a data point is observed and
it is assigned to a class.
model is inferred from data by classification and
The
error-estimation rules.
validity is determined by properties of the
Model
error estimation rule.

Texas A&M
GSPLab

Probabilistic Theory of Clustering


Clustering theory in the context of random sets.
Probabilistic error measure based on points being
clustered correctly.
Bayes clusterer (optimal clustering algorithm).
Learning theory for clustering algorithms.

Dougherty, E. R., and M. Brun, A Probabilistic Theory of Clustering, Pattern


Recognition, 37 (5), 917-925, 2004.

gsp.tamu.edu

Texas A&M
GSPLab

Data Mining Violates Basic Principles


Data mining violates two basic principles of
experimental design: (1) constrain the variables so
that the experiment is only minimally affected by
external conditions and the results elucidate clear
mathematically describable behavior; and (2) all
modeling is done within a rigorous statistical setting
in which both constraints and the sampling
distribution are clearly expressed.

gsp.tamu.edu

Texas A&M
GSPLab

What Data Mining Has Produced


Absent a sound epistemology there is no ground of
knowledge and therefore no knowledge.
There are thousands of papers in the literature for
which there is no demonstration of any meaning at all.
This has several serious consequences:

Huge waste of resources.


Literature is untrustworthy and much of it is useless.
Propagation of meaningless results on meaningless results.
Lack of progress on consequential problems.

gsp.tamu.edu

Texas A&M
GSPLab

Is Data Mining a Serious Scientific Endeavor


Dougherty and Bittner (Epistemology of the Cell):
Does anyone really believe that data mining could
produce the general theory of relativity?