GSPLab

Data Mining

Edward R. Dougherty

Department of Electrical and Computer Engineering

Center for Bioinformatics and Genomic Systems Engineering

Texas A&M University

Texas A&M

GSPLab

Reading

Book: Chapter 8

Papers: Paper: Dougherty, E. R., Prudence, Risk, and

Reproducibility in Biomarker Discovery,

BioEssays, Vol. 34, No. 4, 277-279, 2012.

03/28/15

Texas A&M

GSPLab

Knowledge Discovery

Knowing the constitution of scientific knowledge

and how to validate it leaves open the question of

how to discover knowledge.

Obviously, we need to observe Nature, but in what

manner.

03/28/15

Texas A&M

GSPLab

Francis Bacon (Novum Organum, 1620):

There remains simple experience which, if

taken as it comes, is called accident; if sought

for, experiment. But this kind of experience

isa mere groping, as of men in the dark

But the true method of experience, on the

contrary, first lights the candle, and then by

means of the candle shows the way;

commencing as it does with experience duly

ordered and digested, not bungling or erratic,

and from it educing axioms, and from

established axioms again new experiments.

Texas A&M

GSPLab

Immanuel Kant (Critique of Pure Reason,1781): It is

only when experiment is directed by rational principles

that it can have any real utility. Reason must approach

nature with the view, indeed, of receiving information

from it, not, however, in the character of a pupil, who

listens to all that his master chooses to tell him, but in

that of a judge, who compels the witnesses to reply to

those questions which he himself thinks fit to propose.

To this single idea must the revolution be ascribed, by

which, after groping in the dark for so many centuries,

natural science was at length conducted into the path of

certain progress.

Texas A&M

GSPLab

James Clerk Maxwell: The feature which

presents itself most forcibly to the untrained

inquirer may not be that which is considered

most fundamental by the experienced man of

science; for the success of any physical

investigation depends on the judicious

selection of what is to be observed as of

primary importance, combined with a

voluntary abstraction of the mind from those

features which, however attractive they

appear, we are not yet sufficiently advanced in

science to investigate with profit.

Texas A&M

GSPLab

An Experiment is a Question

Hans Reichenbach (Rise of Scientific

Philosophy): An experiment is a question

addressed to Nature.As long as we

depend on the observation of occurrences

not involving our assistance, the

observable happenings are usually the

product of so many factors that we cannot

determine the contribution of each

individual factor to the total result.

03/28/15

Texas A&M

GSPLab

Reasoning to Science

Hans Reichenbach: By means of the

artificial occurrences of planned

experiments, the complex occurrence of

Nature is thus analyzed into its

components. That Greek science did not

use experiments in any significant way

proves how difficult it was to turn from

reasoning to empirical science.

Science is not constituted by reasoning about

data; it is constituted by pragmatic, predictive

models.

03/28/15

Texas A&M

GSPLab

Arturo Rosenblueth and Norbert

Wiener: An experiment is a question. A

precise answer is seldom obtained if the

question is not precise; indeed, foolish

answers i.e., inconsistent, discrepant or

irrelevant experimental results are

usually indicative of a foolish question.

03/28/15

Texas A&M

GSPLab

Werner Heisenberg: The most important new

result of nuclear physics was the recognition of

the possibility of applying quite different types

of natural laws, without contradiction, to one

and the same physical event. This is due to the

fact that within a system of laws which are

based on certain fundamental ideas only certain

quite definite ways of asking questions make

sense, and thus, that such a system is separated

from others which allow different questions to

be put.

Texas A&M

GSPLab

Mere Observation

Hannah Arendt: [Natural science]

seemed to be liberated by the discovery that

our senses by themselves do not tell the

truth. Henceforth, sure of the unreliability

of sensation and the resulting insufficiency

of mere observation, the natural sciences

turned toward the experiment, which, by

directly interfering with nature, assured the

development whose progress has ever since

appeared to be limitless.

Texas A&M

GSPLab

Hannah Arendt: The experiment being a

question put before nature (Galileo), the

answers of science will always remain

replies to questions asked by men; the

confusion in the issue of objectivity was to

assume that there could be answers without

questions and results independent of a

question-asking being.

Texas A&M

GSPLab

Efficient Experimentation

Douglas Montgomery: If an experiment is

to be performed most efficiently, then a

scientific approach to planning the

experiment must be considered. By the

statistical design of experiments we refer to

the process of planning the experiment so

that appropriate data will be collected, which

may be analyzed by statistical methods

resulting in valid and objective conclusions.

The statistical approach to experimental

design is necessary if we wish to draw

meaningful conclusions from the data.

Texas A&M

GSPLab

Everyday Classification

Some algorithm is proposed.

The algorithm separates some data set.

We are not told the distribution from which the data come.

We are given no reason why the estimate should be good.

In fact, often we expect that the estimate is not good.

be validated.

We are given no justification for the claim.

We are given no conditions under which it is valid.

03/28/15

Texas A&M

GSPLab

(Merriam-Webster) Type of database analysis that

attempts to discover useful patterns or relationships in

a group of data. The analysis uses advanced statistical

methods, such as cluster analysis, and sometimes

employs artificial intelligence or neural network

techniques. A major goal of data mining is to discover

previously unknown relationships among the data,

especially when the data come from different

databases.

Relations among data, not among variables no science!

Texas A&M

GSPLab

(Wikipedia) Data analysis has increasingly been

augmented with indirect, automated data processing,

aided by other discoveries in computer science, such as

neural networks, cluster analysis, genetic algorithms, and

support vector machines. Data mining is the process of

applying these methods with the intention of uncovering

hidden patterns in large data sets.

Uncovering patterns in data sets no science!

Texas A&M

GSPLab

Data Mining

Data mining is a return to pre-Baconian groping, albeit, at

a much faster groping rate than was then possible.

It suffers from three debilitating properties:

It does not ask precise questions.

There is no statistical characterization of the procedure.

As opposed to pattern recognition, it lacks a characterization of

prediction in the context of a distribution.

typically absent a rigorous analysis to the problem at hand.

03/28/15

Texas A&M

GSPLab

Julian L. Simon (Resampling: The New Statistics,

1997): Monte Carlo resampling simulation takes the

mumbo-jumbo out of statistics and enables even

beginning students to understand completely everything

that is done. Even many experts are unable to

understand intuitively the formal mathematical approach

to the subject. Clearly, we need a method free of the

formulas that bewilder almost everyone.

Everyday common sense should replace the mumbo-jumbo

of scientific rigor and, to a great extent, it has.

Texas A&M

GSPLab

Chris Anderson (The End of Theory: The Data Deluge

Makes the Scientific Method Obsolete): The more we

learn about biology, the further we find ourselves from a

model that can explain it. There is now a better way.

Petabytes allow us to say: "Correlation is enough." We

can stop looking for models. We can analyze the data

without hypotheses about what it might show. We can

throw the numbers into the biggest computing clusters

the world has ever seen and let statistical algorithms find

patterns where science cannot With enough data, the

numbers speak for themselves.

GSPLab

For a sample S of size n, there is a design cost: n = n

Bayes.

A classification rule is consistent if E[n] 0 as n .

An error estimator is consistent if the estimate converges

to the true error as n .

What good is this for small samples?

Texas A&M

GSPLab

Appeal to laws of large numbers or central limit

theorems in small-sample settings is unwarranted.

Training-data-based error estimation methods, such as

cross-validation and bootstrap, converge asymptotically

as the sample size goes to infinity, but this is of

virtually no value for small samples.

03/28/15

Texas A&M

GSPLab

Asymptopia

Edward Leamer: Two of the latest products-toend-all-suffering are nonparametric estimation

and consistent standard errors, which promise

results without assumptions, as if we were

already in Asymptopia where data are so plentiful

that no assumptions are needed By

disguising the assumptions on which nonparametric

methods and consistent standard errors rely, the purveyors

of these methods have made it impossible to have an

intelligible conversation about the circumstances in which

their gimmicks do not work well and ought not to be used.

As for me, I prefer to carry parameters on my journey so I

know where I am and where I am going, not travel stoned

on the latest euphoria drug.

Texas A&M

GSPLab

Ronald A. Fisher (1925): Little experience

is sufficient to show that the traditional [large

sample] machinery of statistical processes is

wholly unsuited to the needs of practical

research. Not only does it take a cannon to

shoot a sparrow, but it misses the sparrow!...

The elaborate mechanism built on the theory of infinitely

large samples is not accurate enough for simple laboratory

data. Only by systematically tackling small sample problems

on their merits does it seem possible to apply accurate tests to

practical data.

03/28/15

Texas A&M

GSPLab

Harald Cramer, 1946: It is clear that a

knowledge of the exact form of a sampling

distribution would be of a far greater value

than the knowledge of a number of moment

characteristics or a limiting expression for

large values of n. Especially when we are

dealing with small samples, as is often the

case in the applications, the asymptotic

expressions are sometimes grossly inadequate,

and a knowledge of the exact form of the

distribution would then be highly desirable.

Texas A&M

GSPLab

Hume (Treatise of Human Nature): The

mind is a kind of theatre, where several

perceptions successively make their

appearance; pass, repass, glide away, and

mingle in an infinite variety of postures

and situations. There is properly no

simplicity in it at one time, nor identity in

different [times].

A definition of radical empiricism.

Data mining.

03/28/15

Texas A&M

GSPLab

William Barrett (Illusion of

Technique): The absence of an

intelligent idea in the grasp of a

problem cannot be redeemed by the

elaborateness of the machinery Ione

subsequently employs.

Texas A&M

GSPLab

William Barrett (Illusion of Technique): The scientists

mind is not a passive mirror that reflects the facts as they

are in themselves (whatever that might mean); the

scientist constructs models, which are not found among

the things given him in his experience, and proceeds to

I And he must often

impose those models upon Nature.

construct those models conceptually before they are

translated at any point into the material constructions of

his apparatus in the laboratory.The imprint of mind is

everywhere on the body of this science, and without the

founding power of mind it would not exist.

Texas A&M

GSPLab

Hans Reichenbach (Rise of Scientific

Philosophy): A mere report of relations

observed in the past cannot be called

knowledge. If knowledge is to reveal

objective relations of physical objects, it

must include reliable predictions. A radical

empiricism, therefore, denies the

possibility of knowledge.

A collection of measurements, together with

statements about the measurements, is not

scientific knowledge.

03/28/15

Texas A&M

GSPLab

A Huge Challenge

Janet Woodcock (Director, Center for Drug

Evaluation and Research, FDA): [As much as

75 percent of published biomarker associations

are not replicable] This poses a huge

challenge for industry in biomarker

identification and diagnostics development.

Dougherty, E. R., Prudence, Risk, and Reproducibility in

Biomarker Discovery, BioEssays, 34(4), 277-279, 2012.

Yousefi, M., and E. R. Dougherty, Performance Reproducibility

Index for Classification, Bioinformatics, 28(21), 2824-2833,

2012.

03/28/15

Texas A&M

GSPLab

m data sets of size n, LDA with 10-fold cross-validation

est(0) and true(0) are the estimated and true errors for the

sample with the lowest error estimate, and E[true] is

expected true error over all samples.

Left: est(0) true(0); right: est(0) E[true]; n = 60, 120.

Yousefi, M. R., Hua, J., Sima, C., and E. R. Dougherty, Reporting Bias When Using Real

Data Sets to Analyze Classification Performance, Bioinformatics, 26 (1),

( 68-76, 2010.

03/28/15

Texas A&M

GSPLab

Multiple-Rule Bias

Use r classification rules and s error

estimation rules. Select the pair with

the minimum estimated error, min,est...

Bias(m) = E[min,est true(imin)], over

sampling distribution, m = rs, n = 60.

Yousefi, M. R., Hua, J., and E. R. Dougherty, MultipleRule Bias in the Comparison of Classification Rules,

Bioinformatics, 27(12), 1675-1683, 2011.

Texas A&M

GSPLab

A preliminary study of size n is reproducible with

accuracy 0 if n nest + .

A follow-on study will be performed if nest .

Rn(, ) = P(n nest + | nest ).

Real data sets: LDA, n = 60, 5 features by t-test.

Classification, Bioinformatics, 28(21), 2824-2833, 2012.

Texas A&M

GSPLab

Reproducibility index for m = 5 data sets, LDA, 5F-CV,

5 features, Gaussian with equal covariance matrices,

uncorrelated features

(a) n = 60, = 0.0005; (b) n = 60, = 0.05;

(c) n = 120, = 0.0005; (d) n = 120, = 0.05;

03/28/15

Texas A&M

GSPLab

Class sizes, n0 and n1, pre-determined

Hence, no estimate for class prior

probability c = P(Y = 0).

Random sampling, r = n0/n c, n (prob)

Dark blue (c =0.3), black (c = 0.4), light

blue (c = 0.5), red (c = 0.6), green (c = 0.7)

r* (crossing point) is minimax value

Top equal covariance; bottom unequal

covariance

Esfahani, M. S., and E. R. Dougherty, Effect of

Separate Sampling on Classification Accuracy,

Bionformatics,

Texas A&M

GSPLab

Class sizes, n0 and n1, pre-determined.

Apply classical 5-fold cross-validation on

the data set to estimate the error (dashed

lines).

Apply separate-sampling 5-fold crossvalidation (solid lines).

Dark blue (c =0.3), black (c = 0.4), light blue

(c = 0.5), red (c = 0.6), green (c = 0.7)

Top n = 80; bottom n = 1000

Braga-Neto, U. L., Zollanvari, U. M., and E. R.

Dougherty, Cross-Validation Under Separate

Sampling: Optimistic Bias and How to Correct It,

Texas A&M

GSPLab

Relationship?

time course or

experiments

patterns

genes

GSPLab

Data are clustered

by some clustering

algorithm.

Is there scientific

knowledge here?

GSPLab

Clustering Algorithm

An algorithm that partitions a set of points into several

groups, based on a measure of similarity (or

dissimilarity) between the points.

Example:

x

3

Group 1

Group 2

Group 3

x2

x1

GSPLab

Cluster expression vectors: clusters indicate

potential co-regulation in time-course data analysis.

Cluster samples: clusters indicate potential similar

sources a sort of classification.

Methods

Fuzzy c-means

K-means

S.O.M.

Hierarchical clustering (Euclidean distance)

Hierarchical clustering (correlation)

GSPLab

K-means Clustering

Goal: Partition points into tight clusters.

Algorithm:

Randomly initialize with k means m1,, mk

Place x into Ci if ||x mi|| ||x mj|| for j = 1,, k

Update m1,, mk as the means of C1,, Ck

Repeat until means do not change

Clusters determined by Voronoi diagram of m1,, mk

GSPLab

Hierarchical Clustering

Iteratively join clusters based on similarity measure

(agglomerative clustering).

Farthest neighbor similarity measure:

d(Ci, Cj) = max {||x y|| : x Ci, y Cj}

Initialize clusters by Ci = {xi} for i = 1,, n

Iteratively merge the clusters for which the greatest distance

between points in the two clusters is minimized

Halts when the similarity measure exceeds a pre-defined

threshold

GSPLab

A. cholesterol biosynthesis

B. cell cycle

C. immediate-early response

D. signaling and angiogenesis

E. wound healing and tissue remodeling

Source: Michael B. Eisen, et

al., PNAS 1998, Vol.95

GSPLab

Jain et al.: Clustering is a subjective process; the

same set of data items often needs to be partitioned

differently for different applications.

Jain, A.K., Murty, M. N., and P.J. Flynn, Data Clustering: A Review,

ACM Computer Surveys, 31 (3), 264-323, 1999.

Solution

Mathematical theory

Pattern recognition theory and random set theory

GSPLab

Example:

- 2 or 3 clusters?

- What is the best separation?

Texas A&M

GSPLab

Generate set of points from different distributions: A1,, Ak.

Use clustering algorithm to form clusters: C1,, Ck.

Align point sets and clusters, and count errors.

Average over a number of randomly generated sets.

Dougherty, E. R. , Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y.,

Gene-Expression Microarrays," Computational Biology 9 (1), 105-126, 2002.

GSPLab

Synthetic Example

5 synthetic

templates

Simulated data

from the templates

different variances

5 different

clustering methods

GSPLab

No error!

to small variance

c-means

GSPLab

Experiment ( 2 = 3.0)

many

misclassifications

22 misclassifications

(8.8%)

GSPLab

Before clustering

NICE dendrogram

24.5% Error!!

GSPLab

Clustering Error

Points are a realization S of a labeled random point

process.

Clustering algorithm assigns to S a label function S.

The error of is the expected difference between its

labels and the labels generated by the point process.

Error must take into account that we do not care about

the ordering, only the partitions generated.

Expectation taken with respect to the distribution of the

point process.

GSPLab

Left: Realization of point process

Right: Output of hierarchical clustering

Error: 40%

GSPLab

Clustering Validity

Clustering validity is analogous to classification

validity.

Replace classifier with cluster operator and

classification error with clustering error.

GSPLab

Validation Indices

Validation indices are meant to judge the validity of a

clustering output.

They can be based on a number of heuristic

considerations and methodologies.

Do they correspond to scientific validity?

Does a validation index correlate to clustering error?

Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and E. R.

Dougherty, Model-Based Evaluation of Clustering Validation Measures,

Pattern Recognition, 40 (3), 807-824, 2007.

Texas A&M

Top: Realization of

point process

Bottom: Kendalls

correlation:

Dunns index, D correl,

silhouette, figure of merit

GSPLab

Top: Realization of point

process

Bottom: Kendalls

correlation:

Dunns index, D correl,

silhouette, figure of merit

GSPLab

Scientific Knowledge

Requires a mathematical model.

In classification, the model is learned from training data.

Can inferences be made from the model?

GSPLab

model is composed of a classifier (decision

The

function) and an error a data point is observed and

it is assigned to a class.

model is inferred from data by classification and

The

error-estimation rules.

validity is determined by properties of the

Model

error estimation rule.

GSPLab

Clustering theory in the context of random sets.

Probabilistic error measure based on points being

clustered correctly.

Bayes clusterer (optimal clustering algorithm).

Learning theory for clustering algorithms.

Recognition, 37 (5), 917-925, 2004.

Texas A&M

GSPLab

Data mining violates two basic principles of

experimental design: (1) constrain the variables so

that the experiment is only minimally affected by

external conditions and the results elucidate clear

mathematically describable behavior; and (2) all

modeling is done within a rigorous statistical setting

in which both constraints and the sampling

distribution are clearly expressed.

Texas A&M

GSPLab

Absent a sound epistemology there is no ground of

knowledge and therefore no knowledge.

There are thousands of papers in the literature for

which there is no demonstration of any meaning at all.

This has several serious consequences:

Literature is untrustworthy and much of it is useless.

Propagation of meaningless results on meaningless results.

Lack of progress on consequential problems.

Texas A&M

GSPLab

Dougherty and Bittner (Epistemology of the Cell):

Does anyone really believe that data mining could

produce the general theory of relativity?

