Beruflich Dokumente
Kultur Dokumente
Email: expertsyssol@gmail.com
expertsyssol@yahoo.com
Cell: 9952749533
www.researchprojects.info
PAIYANOOR, OMR, CHENNAI
Call For Research Projects Final
year students of B.E in EEE, ECE,
EI, M.E (Power Systems), M.E
(Applied Electronics), M.E (Power
Electronics)
Ph.D Electrical and Electronics.
Students can assemble their hardware in our
Research labs. Experts will be guiding the
projects.
Classification of Microarray
Gene Expression Data
Geoff McLachlan
Department of Mathematics & Institute for Molecular Bioscience
University of Queensland
Institute for Molecular Bioscience,
University of Queensland
“A wide range of supervised and
unsupervised learning methods have been
considered to better organize data, be it to
infer coordinated patterns of gene
expression, to discover molecular
signatures of disease subtypes, or to derive
various predictions. ”
• Unsupervised classification
(clustering) of tissues – mixture
model-based approach
Vital Statistics
by C. Tilstone
Nature 424, 610-612, 2003.
“DNA microarrays have
given geneticists and
molecular biologists access
to more data than ever
before. But do these Branching out: cluster
analysis can group
researchers have the samples that show
similar patterns of gene
statistical know-how to expression.
cope?”
MICROARRAY DATA
REPRESENTED by a p × n matrix
(x1,, xn )
xj contains the gene expressions for the p genes
of the jth tissue sample (j = 1, …, n).
p =No. of genes (103 - 104)
n =No. of tissue samples (10 - 102)
C ( x) 0 β x T
β0 β1 x1 β p x p
for the production of the group label y of
a future entity with feature vector x.
FISHER’S LINEAR DISCRIMINANT FUNCTION
y sign C ( x )
1
where β S ( x1 x 2 )
1
0 ( x1 x 2 ) S ( x1 x 2 )
T 1
2
and x1 , x 2 , and S are the sample means and pooled sample
covariance matrix found from the training data
SUPPORT VECTOR CLASSIFIER
Vapnik (1995)
C ( x ) β0 β1 x1 β p x p
where β0 and β are obtained as follows:
n
1
j
2
min β
β , 0 2 j 1
subject to j 0,
y j C(x j ) 1 j ( j 1, , n)
REPLACE x by h( x )
n
C ( x ) ˆ j h( x j ), h( x ) ˆ0
j 1
n
ˆ j K ( x j , x ) ˆ0
j 1
http://www.pnas.org/cgi/content/full/99/10/6562
GUYON, WESTON, BARNHILL & VAPNIK
(2002, Machine Learning)
LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero
CVE (cross-validated error rate)
COLON DATA:
Using only 4 genes, CVE is 2%
GUYON et al. (2002)
And so on up to xn.
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
ADDITIONAL REFERENCES
k 1 k 1
if xj kth bootstrap sample
with Ijk 10 otherwise
*
and Qjk 1 if R k misallocates xj
0 otherwise
Toussaint & Sharpe (1975) proposed the
ERROR RATE ESTIMATOR
A(w) (1 - w)AE wCV2E
where w 0. 5
McLachlan (1977) proposed w=wo where wo is
chosen to minimize asymptotic bias of A(w) in the
case of two homoscedastic normal groups.
.632
where w
1 .368r
B1 AE
r (relative overfitting rate)
AE
g
pi (1 qi ) (estimate of no information error rate)
i 1
f ( x ) 1 ( x; μ1 , Σ1 ) g ( x; μg , Σ g )
where
2 log ( x; μ, Σ ) ( x μ) Σ ( x μ) constant
TT 11
MAHALANOBIS DISTANCE
( x μ )T ( x μ )
EUCLIDEAN DISTANCE
MIXTURE OF g NORMAL COMPONENTS
f ( x ) 1 ( x; μ1 , Σ1 ) g ( x; μg , Σ g )
k-means
Σ1 Σgg σ II 22
SPHERICAL CLUSTERS
Equal spherical covariance matrices
Crab Data
f ( x ) 1 ( x; μ1 , Σ1 ) i ( x; μi , Σi )
g ( x; μg , Σ g )
http://www.maths.uq.edu.au/~gjm
http://www.maths.uq.edu.au/~gjm/EMMIX_Demo/emmix.html
PROVIDES A MODEL-BASED
APPROACH TO CLUSTERING
http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf
Example: Microarray Data
Colon Data of Alon et al. (1999)
n=62 (40 tumours; 22 normals)
tissue samples of
p=2,000 genes in a
2,000 62 matrix.
Mixture of 2 normal components
Mixture of 2 t components
Mixture of 2 t components
Mixture of 3 t components
In this process, the genes are being treated
anonymously.
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Clustering of COLON Data
Tissues using EMMIX-GENE
Grouping for Colon Data
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Mixtures of Factor Analyzers
A normal mixture model without restrictions
on the component-covariance matrices may
be viewed as too general for many situations
in practice, in particular, with high
dimensional data.
where
i Bi B DiT
i (i 1,..., g ),
Bi is a p x q matrix and Di is a
diagonal matrix.
Number of Components
in a Mixture Model
Testing for the number of components,
g, in a mixture is an important but very
difficult problem which has not been
completely resolved.
Order of a Mixture Model
A mixture density with g components might
be empirically indistinguishable from one
with either fewer than g components or
more than g components. It is therefore
sensible in practice to approach the question
of the number of components in a mixture
model in terms of an assessment of the
smallest number of components in the
mixture compatible with the data.
Likelihood Ratio Test Statistic
An obvious way of approaching the
problem of testing for the smallest value of
the number of components in a mixture
model is to use the LRTS, -2log. Suppose
we wish to test the null hypothesis,
H 0 : g g 0 versus H1 : g g1
for some g1>g0.
We let Ψ̂ i denote the MLE of Ψ calculated
under Hi , (i=0,1). Then the evidence against
H0 will be strong if is sufficiently small, or
equivalently, if -2log is sufficiently large,
where
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
Breast cancer data set in van’t Veer et al.
(van’t Veer et al., 2002, Gene Expression Profiling Predicts
Clinical Outcome Of Breast Cancer, Nature 415)
These data were the result of microarray experiments
on three patient groups with different classes of
breast cancer tumours.
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
i mi Ui i mi Ui i mi Ui i mi Ui
1 146 112.98 11 66 25.72 21 44 13.77 31 53 9.84
2 93 74.95 12 38 25.45 22 30 13.28 32 36 8.95
3 61 46.08 13 28 25.00 23 25 13.10 33 36 8.89
4 55 35.20 14 53 21.33 24 67 13.01 34 38 8.86
5 43 30.40 15 47 18.14 25 12 12.04 35 44 8.02
6 92 29.29 16 23 18.00 26 58 12.03 36 56 7.43
7 71 28.77 17 27 17.62 27 27 11.74 37 46 7.21
8 20 28.76 18 45 17.51 28 64 11.61 38 19 6.14
9 23 28.44 19 80 17.28 29 38 11.38 39 29 4.64
10 23 27.73 20 55 13.79 30 21 10.72 40 35 2.44
where i = group number
mi = number in group i
Ui = -2 log λi
Heat Map of Genes in Group G1
Heat Map of Genes in Group G2
Heat Map of Genes in Group G3
1. A change in gene expression is apparent between
the sporadic (first 78 tissue samples) and hereditary
(last 20 tissue samples) tumours.
2. The final two tissue samples (the two BRCA2
tumours) show consistent patterns of expression.
This expression is different from that exhibited by
the set of BRCA1 tumours.
3. The problem of trying to distinguish between the
two classes, patients who were disease-free after 5
years 1 and those with metastases within 5 years
2, is not straightforward on the basis of the gene
expressions.
Selection of Relevant Genes
We compared the genes selected by EMMIX-
GENE with those genes retained in the original
study by van’t Veer et al. (2002).
van’t Veer et al. used an agglomerative
hierarchical algorithm to organise the genes into
dominant genes groups. Two of these groups were
highlighted in their paper, with their genes
corresponding to biologically significant features.
Number of matches
Number
Identification of van’t Veer et al. with genes retained
of genes
by select-gene
containing genes co-regulated with the
Cluster A 40 24
ER-a gene (ESR1)
containing “co-regulated genes that are
the molecular reflection of extensive
Cluster B 40 23
lymphocytic infiltrate, and comprise a set
of genes expressed in T and B cells”