Sie sind auf Seite 1von 7

1246 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO.

5, MAY 2011
Incremental Fuzzy Mining of Gene Expression
Data for Gene Function Prediction
Patrick C. H. Ma* and Keith C. C. Chan
AbstractDue to the complexity of the underlying biological
processes, gene expression data obtained from DNA microarray
technologies are typically noisy and have very high dimensionality
andthese make the mining of suchdata for gene functionprediction
very difcult. To tackle these difculties, we propose to use an in-
cremental fuzzy mining technique called incremental fuzzy mining
(IFM). By transforming quantitative expression values into linguis-
tic terms, such as highly or lowly expressed, IFM can effectively
capture heterogeneity in expression data for pattern discovery. It
does so using a fuzzy measure to determine if interesting associ-
ation patterns exist between the linguistic gene expression levels.
Based on these patterns, IFMcan make accurate gene function pre-
dictions and these predictions can be made in such a way that each
gene can be allowed to belong to more than one functional class
with different degrees of membership. Gene function prediction
problem can be formulated both as classication and clustering
problems, and IFM can be used either as a classication technique
or together with existing clustering algorithms to improve the clus-
ter groupings discovered for greater prediction accuracies. IFM is
characterized also by its being an incremental data mining tech-
nique so that the discovered patterns can be continually rened
based only on newly collected data without the need for retrain-
ing using the whole dataset. For performance evaluation, IFM has
been tested with real expression datasets for both classication and
clustering tasks. Experimental results show that it can effectively
uncover hidden patterns for accurate gene function predictions.
Index TermsBioinformatics, fuzzy data mining, gene expres-
sion data analysis, gene function prediction, pattern discovery.
I. INTRODUCTION
A
DVANCEMENT in DNA microarray technologies has
made simultaneous monitoring of the expression levels
of thousands of genes under different experimental conditions
possible [1][3]. The gene expression data obtained through
such technologies can be useful for many applications in bioin-
formatics, if properly analyzed. For instance, they can be used
to facilitate gene function prediction [4].
The gene function prediction problem can be formulated as a
clustering problem so that, given a database of gene expression
data, a clustering algorithmcan be used to group genes that have
similar expression proles into clusters [4][6]. Since genes that
Manuscript received December 19, 2009; revised February 9, 2010: accepted
March 21, 2010. Date of publication April 15, 2010; date of current version
April 20, 2011. This work was supported in part by the Hong Kong Polytechnic
University. Asterisk indicates corresponding author.
*P. C. H. Ma is with the Department of Computing, The Hong Kong Poly-
technic University, Hong Kong, China (e-mail: cschma@comp.polyu.edu.hk).
K. C. C. Chan is with the Department of Computing, The Hong Kong Poly-
technic University, Hong Kong, China (e-mail: cskcchan@comp.polyu.edu.hk).
Digital Object Identier 10.1109/TBME.2010.2047724
perform the same biological functions are expected to exhibit
similar expression patterns across different experimental condi-
tions, the expression proles of the genes belonging to the same
cluster are expected to perform the same functions. Clustering
gene expression data can, therefore, be useful in identifying
different gene function groups and a gene that is grouped with
another gene in the same group can be expected to perform the
same functions. For the purpose of clustering gene expression
data, clustering techniques [7][14] such as the hierarchical ag-
glomerative clustering algorithm[7], the k-means algorithm[8],
the self-organizing map (SOM) [9], and the fuzzy c-means al-
gorithm [10] have been commonly used.
Other than formulating the gene function prediction prob-
lem as a clustering problem and tackling it using clustering
algorithms, it should be noted that the problem can also be
formulated as a classication problem and tackled using clas-
sication techniques. Given the expression proles of several
classes of genes that perform known biological functions, clas-
sication techniques can be used to discover the characteristics
of each gene class so that the gene whose class membership is
not known earlier can be classied based on the characteristics
discovered. The function that the gene performs can then be
taken to be the same as that of the genes that belong to the class
it is classied into. Classication techniques that have been used
for gene function prediction include those that are based on the
k-nearest-neighbor (k-NN) [15] and the support vector machine
(SVM) [16].
Since biological processes are naturally complex, irregular
expression patterns can always exist among genes that belong
even to the same functional classes [17]. Also, since gene expres-
sion data are noisy and have very high dimensionality [18][20],
gene function prediction, whether it is formulated as a clustering
or classication problem, is difcult and traditional clustering
and classication techniques, which are not originally devel-
oped to deal with gene expression data, may not always be the
most suitable. For example, it may not be easy for these al-
gorithms to discover that genes in a particular functional class
are usually highly expressed under some experimental con-
ditions, whereas they are likely to be lowly expressed under
some others. The use of distance or correlation measures by
these algorithms may not be able to uncover such patterns and
this is especially difcult when the data being dealt with are very
noisy. To discover such patterns, the quantitative gene expres-
sion data should best be discretized into intervals representing
highly expressed, lowly expressed, etc. To do so, it should
be noted that the discretization process divides quantitative data
into nonoverlapping crisp intervals [5], and such an approach
has the disadvantage that it does not handle values at interval
0018-9294/$26.00 2011 IEEE
MA AND CHAN: INCREMENTAL FUZZY MINING OF GENE EXPRESSION DATA FOR GENE FUNCTION PREDICTION 1247
boundaries very well. A slight change in interval boundaries
may lead to very different interpretation of gene expression
values and may introduce more noise into the data making it
impossible for important patterns to be easily discovered.
In light of the prowess of fuzzy logic [21][23] in dealing with
the uncertainties arising from noisy and inexact data, which are
quite commonplace in expression data and also to make the pat-
terns discovered easily interpretable by human users, recently,
some fuzzy logic-based data mining approaches for gene ex-
pression data analysis have been proposed [24][26]. In [24]
and [25], fuzzy approaches were applied to search gene ex-
pression data for regulatory triplets consisting of the activator,
repressor and target genes. Gene expression levels are rst con-
verted into three different states (low, medium, and high) to
varying degrees based on a set of predened membership func-
tions. Genes are then paired as activatorrepressor pairs to de-
termine the expression value of the target gene based on a set of
fuzzy rules. These regulatory triplets are then ranked based on
a residual score between predicted and actual expression values
and a variance score of the activatorrepressor gene pair. The
triplets with low residual score and low variance score are then
the most likely to exhibit the regulatory relationships. These ap-
proaches are not applicable to gene function prediction, as they
are specically developed for solving gene regulatory networks
reconstruction problems. In addition, the similarity or distance
measures that existing fuzzy logic-based approaches [24][26]
use do not tell us what expression levels under what experi-
mental conditions are important in characterizing the genes in
a functional class. As the discovering of such patterns are im-
portant to gene function analysis, we propose to use a technique
called incremental fuzzy mining (IFM) for gene function pre-
diction.
The rest of this paper is organized as follows. In Section II,
the proposed technique is described. The effectiveness of this
technique has been evaluated through experiments with real
gene expression datasets. The experimental setup, together with
the results, is discussed in Section III. In Section IV, we give a
summary of the paper.
II. PROPOSED FUZZY DATA MINING TECHNIQUE
By transforming quantitative expression values into linguistic
terms, such as highly or lowly expressed, IFM can effectively
capture heterogeneity in expression data for pattern discovery.
It does so using a fuzzy measure to determine if interesting
association patterns exist between the linguistic gene expression
levels. Based on these patterns, IFM can make accurate gene
function predictions and these predictions can be made in such
a way that each gene can be allowed to belong to more than
one functional class with different degrees of membership. In
summary, IFM consists of the following four steps.
1) Dene a set of linguistic terms that gene expression data
can be transformed into using predened fuzzy member-
ship functions.
2) Discover fuzzy association patterns between the differ-
ent linguistic expression levels using our proposed fuzzy
measure.
3) Assign a weight to each discovered association pattern and
the weight represent how important a particular linguistic
expression level of a gene in a particular class is for the
characterization of that class.
4) Predict the functional classes of genes whose class mem-
berships are unknown based on the weighted association
patterns.
A. Step 1: Fuzzication
IFM works with fuzzy linguistic variables. To dene these
variables, let us assume that we are given a set of gene ex-
pression data G consisting of the data collected from N genes
in M experiments carried out under different sets of exper-
imental conditions. Let us represent the dataset as a set of
N genes, G = {G
1
, . . . , G
i
, . . . , G
N
} with each gene G
i
, i =
1, . . . , N, characterized by M experimental conditions, E =
{E
1
, . . . , E
j
, . . . , E
M
}, whose values, e
i1
, . . . , e
ij
, . . . , e
iM
,
where e
ij
domain(E
j
) represent the expression value of the
ith gene under the jth experimental condition, and also each
gene is preclassied into one of the known functional classes.
In case, if the class information of the dataset is not available,
then it is a clustering problem that is being dealt with. In such
case, a two-phase clustering approach can be used. In the rst
phase, any popular clustering algorithm can be used to group
genes into a set of initial clusters and then, in the second phase,
we can apply IFM to the clusters discovered. Each member of
each cluster is then considered in turn to see if it should remain
in a cluster or be assigned to a different one.
In order to minimize the impact of noisy data in the data
mining process, we propose to represent these quantitative gene
expression data in linguistic variables and terms using the con-
cepts of fuzzy set. To do so, we let L = {L
1
, . . . , L
j
, . . . , L
M
}
be a set of linguistic variables such that L
j
L corresponds to
E
j
E. For each quantitative attribute E
j
, we denote the do-
main of the attribute as dom(E
j
) = [l
j
, u
j
] , where l
j
and
u
j
represent the lower and upper bounds, respectively. In other
words, the linguistic variable L
j
, that represents E
j
, has to take
on the linguistic terms dened in dom(E
j
). The set of these
linguistic terms is denoted as T(L
j
) = {l
jk
|k = 1, . . . , s
j
},
where l
jk
is a linguistic term characterized by a fuzzy set, F
jk
with membership function
F
j k
dened on dom(E
j
) so that

F
j k
: dom(E
j
) [0, 1]. Given the aforementioned notations,
we represent the value of the linguistic variable L
j
in G
i
as l
jk
and the corresponding degree of membership as
F
j k
.
Since gene expression can be described in a nite number
of different states such as highly expressed and highly re-
pressed, upregulated and downregulated, expressed and
not expressed, or other different number of states [4], for
our application here, we adopt three different states proposed
in [27], as this approach has been successfully applied to gene
expression data with promising results. The three states are
highly expressed (H), averagely expressed (A), and lowly
expressed (L), and they are dened by the three fuzzy sets
shown in Fig. 1.
For each quantitative attribute E
j
, where j = 1, . . . , M, let
E
j
m a x
and E
j
m i n
denote the maximum and minimum values
1248 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
Fig. 1. Membership functions.
Fig. 2. Degree of membership.
that E
j
can take on. Let the values of E
j
be sorted in ascend-
ing order and let P
j1
be the value of E
j
that exceeds one-third
of the measurements and is less than the remaining two-thirds,
and P
j2
be the value of E
j
that exceeds two-thirds of the mea-
surements and is less than the remaining one-third. In order to
determine P
j1
and P
j2
, the measurements were divided into a
number of small class intervals n
c
of equal width (i.e., n
c
=
10 as suggested in [27]), and counted the corresponding class
frequencies f
i
where i = 1, 2, . . . , n
c
. The position of the kth
partition value, where k = 1, 2 for three partitions, is calculated
as follows [27]:
P
jk
= low
i
+
R
k
cf
i1
f
i
(1)
where low
i
is the lower limit of the ith class interval, R
k
=
(N k)/N
F
is the rank of the kth partition value, N
F
is the
total number of fuzzy sets, N is the total number of mea-
surements, and cf
i1
is the cumulative frequency of the im-
mediately preceding class interval such that cf
i1
< R
k
< cf
i
.
We dene A
j1
= (E
j
m i n
+P
j1
)/2, A
j2
= (P
j1
+P
j2
)/2, and
A
j3
= (P
j2
+E
j
m a x
)/2. Given the aforementioned representa-
tion, the degree of membership of a gene expression value e
ij
of
E
j
in G
i
to each fuzzy set can be computed as shown in Fig. 2.
B. Step 2: Fuzzy Association Pattern Discovery
For pattern discovery, let l
pq
l
jk
be the fuzzy association
pattern between the linguistic terms l
pq
and l
jk
in a particular
gene functional class, where l
pq
is the value of the linguistic
variable L
p
in G
i
, and q = 1, . . . , s
p
. Then, the observed total
degree o(l
pq
l
jk
) of the occurrences of this association in
such a class is dened as follows:
o(l
pq
l
jk
) =
N

n=1
min(
F
p q
,
F
j k
) (2)
where N

is the total number of genes in this class that has this


association.
To decide whether the fuzzy association pattern l
pq
l
jk
is interesting, we determine whether the difference between the
observed total degree o(l
pq
l
jk
) and the expected total degree
e(l
pq
l
jk
) is statistically signicant. To determine if this is
the case, we can use the standardized residual [28], [29] to scale
the difference as follows:
z
pj
=
o(l
pq
l
jk
) e(l
pq
l
jk
)

e(l
pq
l
jk
)
(3)
where
e(l
pq
l
jk
) =

n=1

F
p q

n=1

s
p
q=1

F
p q

n=1

F
j k

n=1

s
j
k=1

F
j k

n=1
s
p

q=1
s
j

k=1
min(
F
p q
,
F
j k
). (4)
As this statistic approximates the standard normal distribution
only when the asymptotic variance of z
pj
is close to one, it is,
in practice, adjusted by its variance for a more precise analysis.
The new test statistic, which is called the adjusted residual, can
be expressed as follows:
d
pj
=
z
pj

v
pj
(5)
where v
pj
is the maximum likelihood estimate of its asymptotic
variance and is dened as:
v
pj
=

n=1

F
p q

n=1

s
p
q=1

F
p q

1

n=1

F
j k

n=1

s
j
k=1

F
j k

.
(6)
This statistic d
pj
has an approximate standard normal distri-
bution [30], [31] and the fuzzy association pattern l
pq
l
jk
is
interesting when the test statistic is statistically signicant. In
other words, if d
pj
> 1.96 (5), we can conclude, with a con-
dence level of 95%, that the fuzzy association pattern l
pq
l
jk
in this class is interesting. It should be noted that the test statis-
tic could always be updated as new gene expression data are
collected. In other words, IFM can discover new patterns incre-
mentally without having to retrain the whole dataset.
C. Step 3: Weight Assignment
Since the fuzzy association pattern discovered in Step 2 is
not completely deterministic, the uncertainty associated with
l
pq
l
jk
can be modeled with the condence measure dened
MA AND CHAN: INCREMENTAL FUZZY MINING OF GENE EXPRESSION DATA FOR GENE FUNCTION PREDICTION 1249
as Pr(l
jk
|l
pq
). For the purpose of making use of l
pq
l
jk
for classication, we use a weight of evidence measure [32]
W(l
pq
l
jk
), which is dened in terms of the mutual informa-
tion I(l
jk
: l
pq
) as follows:
W(l
pq
l
jk
) = W(l
jk
/l
jk
|l
pq
) = I(l
jk
: l
pq
) I(l
jk
: l
pq
)
(7)
where
I(l
jk
: l
pq
) = log
Pr(l
jk
|l
pq
)
Pr(l
jk
)
= log

n=1
min(
F
p q
,
F
j k
)

n=1

s
p
q=1
min(
F
p q
,
F
j k
)
(8)
and
I(l
jk
: l
pq
)
= log
Pr(l
jk
|l
pq
)
Pr(l
jk
)
= log

n=1

s
j
k

=1,k

=k
min(
F
q
(e
(t1)
),
F
j k

(e
tj
))

n=1

s
j
k

=1,k

=k

s
p
q=1
min(
F
p q
,
F
j k

)
. (9)
The termPr(l
jk
|l
pq
) can be considered as being the probabil-
ity of the pattern l
pq
l
jk
observed in a class. Pr(l
jk
|l
pq
) can
be considered as being the probability of the pattern l
pq
l
jk
,
where k = k

, is observed in a class. W(l


pq
l
jk
) measures
the amount of positive or negative evidence that is provided by
l
pq
supporting or refuting l
jk
being observed together. Since this
measure is probabilistic, it can work effectively even when the
data being dealt with contains incomplete, missing, or erroneous
values. It should be noted that W(l
pq
l
jk
) may not equal to
W(l
jk
l
pq
); in this analysis, we take the one with the largest
weight of evidence.
D. Step 4: Gene Function Prediction
Given a set of gene expression data collected from a set of
N

genes from previously unseen gene expression data (i.e.,


gene expression data that are not in the original database). This
set of N

genes can be represented by G

= {G
(1)
, . . . , G
(i)
,
. . . , G
(N

)
}, where G

G and N

N. To predict the class


membership of G
(i)
in G

, the fuzzy association patterns pre-


viously discovered in each class can be searched to see which
patterns can match with the expression prole of G
(i)
. If the
association pattern l
(p)q
l
(j)k
discovered earlier in a particu-
lar class can match with G
(i)
, then we can conclude that there
is some evidence supporting G
(i)
belongs to this class. Then,
the weight of evidence W

(l
(p)q
l
(j)k
) supporting the assign-
ment of G
(i)
to a class can be dened as follows [30]:
W

(l
(p)q
l
(j)k
) = W(l
pq
l
jk
)
F
( p ) q
. (10)
Then, we can combine all the evidences provided by the fuzzy
association patterns (in a set ) that can support the assignment
of G
(i)
to a class by computing a total weight of evidence
measure as follows:
TW

(l
(p)q
l
(j)k
) =
||

p=1
W

(l
(p)q
l
(j)k
). (11)
Finally, the degree of membership that G
(i)
belongs to each
class can be calculated by using (11) and G
(i)
can then be as-
signed to different classes with different degrees of membership.
III. EXPERIMENTAL RESULTS
For performance evaluation, two different experiments in-
volving real gene expression data were performed. The rst
experiment is to evaluate the prediction performance of the pro-
posed IFM when compared to the traditional classiers. In the
second experiment, we used the IFM with other clustering al-
gorithms commonly used for gene function prediction to see
whether it can improve their performances. The details of the
results of each of these experiments are discussed in this section.
A. Experiment One
In this experiment, we used a genome-wide expression dataset
that consists of the entire yeast genome (6221 genes) under 80
different experimental conditions and can be downloaded from
Eisens Lab (http://rana.lbl.gov/EisenData.htm). To annotate the
function of each gene in this dataset, we used the well-known
Munich Information Center for Protein Sequences (MIPS) func-
tional catalogue database [33]. The dataset consists of 52 MIPS
functional classes. In our study, we used the top-N accuracy
as a measure to evaluate performance [5]. For each algorithm,
the measure calculates the likelihood of each functional class
C
p
where p = 1, . . . , P and P is the total number of classes,
that each gene G
i
belongs to. As a result, each classication
algorithm has a ranked ordering (in descending order) of the
likelihood of all functional classes that each gene belongs to.
The algorithm is, therefore, considered to have made a correct
prediction if any of the top-N functional classes that the gene
was classied into is actually a function of the gene [6]. Here, we
set N = 3 (top-3 accuracy) as a yeast gene typically has around
three different functions [33]. In our experiments, the overall
accuracy was computed based on a ten-fold cross-validation
approach [5]. For each trial, 90% of the genes in each class
were randomly selected for training and the remaining 10%
were used for testing. After ten experiments corresponding to
the ten-folds were performed, the average accuracy can then
be computed.
For performance evaluation, we compared the perfor-
mance of IFM with k-NN and SVM (Gist package:
http://bioinformatics.ubc.ca/gist/) (as discussed in Section I).
For k-NN, we set k = 12 as this setting can give us the best
result and then the N most common labels among the K-NNs
were assigned to the unclassied gene. For SVM, we used the
popular linear kernel as suggested in [16]. To handle multiclas-
sication problems, SVM requires that a classier be developed
for each functional class. Here, we used the one versus the rest
method [34] to decompose the problem into a number of binary
classication problems before training SVM. Similar to k-NN,
1250 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
TABLE I
COMPARISON OF DIFFERENT CLASSIFICATION ALGORITHMS (TEN LARGEST
FUNCTIONAL CLASSES, PREDICTION ACCURACY)
TABLE II
COMPARISON OF DIFFERENT CLASSIFICATION ALGORITHMS (GENOME-WIDE
FUNCTIONAL PREDICTION52 CLASSES, PREDICTION ACCURACY)
TABLE III
COMPARISON OF THE PROPOSED TECHNIQUE (IFM) USING DIFFERENT
PERCENTAGE OF ATTRIBUTES FOR TRAINING (TEN LARGEST FUNCTIONAL
CLASSES, PREDICTION ACCURACY)
the N most common labels among the classiers were assigned
to the unclassied gene.
Tables I and II show the comparisons of the performance
of different algorithms for the ten largest functional classes
and the genome-wide functional prediction (52 classes), respec-
tively. According to these tables, we found that IFMoutperforms
other traditional classication algorithms. For those classes that
have the highest average prediction accuracies (over 80) (i.e.,
transport routes, cell cycle, RNA synthesis, and carbohydrate
metabolism), we found that the genes in these classes (on aver-
age, over 480 genes) typically have 23 biological functions (in
general, over 92% of the genes in the yeast genome have three
or more functions), and IFM can successfully predict most of
these functions. This indicates that IFM is able to take into ac-
count the heterogeneity present within each functional class,
and therefore, more accurate performance can be obtained.
In order to show that the discovered patterns can be continu-
ally rened based only on newly collected data without the need
for retraining using the whole dataset, additional experiments
were performed using the same dataset and the results obtained
are shown in Tables III and IV. In these experiments, we rst
used 30% of the attributes to train IFM, and then additional
10% of the attributes were added and trained only in each trail.
TABLE IV
COMPARISON OF THE PROPOSED TECHNIQUE (IFM) USING DIFFERENT
PERCENTAGE OF ATTRIBUTES FOR TRAINING (GENOME-WIDE FUNCTIONAL
PREDICTION52 CLASSES, PREDICTION ACCURACY)
TABLE V
COMPARISON OF CLUSTERING ALGORITHMS (z-SCORE)
Hence, there are eight trails in total (i.e., T1: 30%, T2: 40%, T3:
50%, T4: 60%, T5: 70%, T6: 80%, T7: 90%, and T8: 100%).
According to these tables, we found that IFM can continually
improve the prediction accuracies based only on those newly
added attributes.
B. Experiment Two
The dataset we used in this experiment contains a set of 517
genes [35], whose expression levels vary in response to serum
concentration in human broblasts. In this dataset, ten functional
classes (clusters) were reported [36]. For performance evalua-
tion, we used the z-score measure proposed in [37], which is
based on mutual information between a clustering result and
gene annotation data. A higher z-score then indicates a cluster-
ing result that deviates further from random.
For experimentations, we used the hierarchical agglomera-
tive clustering algorithm, the k-means algorithm, SOM, and the
fuzzy c-means algorithm (as discussed in Section I), respec-
tively, to discover initial clusters (classes) in this dataset (see
Table V). After the initial clustering process, we then applied
IFM to each clustering results obtained (except the results ob-
tained by using the fuzzy c-means algorithm as it is able to
discover genes with multiple functions) so that genes that have
already been assigned to the initial clusters before were reeval-
uated to determine if they should remain in the same cluster,
or different cluster, or be assigned to more than one (see Ta-
ble VI). According to these tables, we found that when applying
IFM to different clustering algorithms, it can improve their per-
formance. For the k-means algorithm and SOM, they can also
outperform the fuzzy c-means algorithm. In addition, we also
performed additional experiments to show the incremental fea-
ture of IFM as we did in the rst experiment. According to
Tables VIIIX, we found that IFM can still improve the perfor-
mance of existing clustering algorithms continually based only
on those newly added attributes. Also, the results demonstrated
that the proposed incremental data mining technique is able to
MA AND CHAN: INCREMENTAL FUZZY MINING OF GENE EXPRESSION DATA FOR GENE FUNCTION PREDICTION 1251
TABLE VI
COMPARISON OF EACH CLUSTERING ALGORITHM WITH THE PROPOSED
TECHNIQUE (IFM) (z-SCORE)
TABLE VII
COMPARISON OF THE HIERARCHICAL CLUSTERING ALGORITHM WITH THE
PROPOSED TECHNIQUE (IFM) USING DIFFERENT PERCENTAGE OF ATTRIBUTES
FOR TRAINING (z-SCORE)
TABLE VIII
COMPARISON OF k-MEANS WITH THE PROPOSED TECHNIQUE (IFM) USING
DIFFERENT PERCENTAGE OF ATTRIBUTES FOR TRAINING (z-SCORE)
TABLE IX
COMPARISON OF SOM WITH THE PROPOSED TECHNIQUE (IFM) USING
DIFFERENT PERCENTAGE OF ATTRIBUTES FOR TRAINING (z-SCORE)
work well on other datasets for other organisms that are not as
well annotated as yeast.
IV. CONCLUSION
In this paper, we propose a fuzzy data mining technique called
IFM for the discovering of fuzzy association patterns in gene
expression data for gene function prediction problems. IFM
is able to perform its task effectively by taking into account
the heterogeneity in gene expression data based on concepts in
fuzzy set theory. It transforms continuous gene expression data
into linguistic values that can express such data in qualitative
descriptions as highly expressed and lowly expressed values, etc.
Using a fuzzy measure, it can discover linguistic gene expression
values for each class or cluster of gene function groups that
are relevant for the characterization of each functional class.
The proposed IFM technique is unique in the sense that the
discovered patterns can be continually rened based only on
newly collected data without the need for retraining using the
whole dataset. In other words, the prediction accuracy of IFM
can be improved through an iterative process based only on
newly collected data. For performance evaluation, IFMhas been
tested with different sets of gene expression data. Experimental
results show that it can be very effective. It can attain very high
function prediction accuracy and can also be used with existing
clustering algorithm to improve their performance.
REFERENCES
[1] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, Quantitative
monitoring of gene expression patterns with a complementary DNA mi-
croarray, Science, vol. 270, no. 5235, pp. 467470, 1995.
[2] D. Lockhart, H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S.
Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E. L. Brown,
Expression monitoring by hybridization to high-density oligonucleotide
arrays, Nat. Biotechnol., vol. 14, pp. 16751680, 1996.
[3] D. J. Lockhart and E. A. Winzeler, Genomic, gene expression and DNA
arrays, Nature, vol. 405, no. 6788, pp. 827836, 2000.
[4] A. Zhang, Advanced Analysis of Gene Expression Microarray Data.
Singapore: World Scientic, 2006.
[5] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed.
San Mateo, CA: Morgan Kaufmann, 2006.
[6] D. P. Berrar, W. Dubitzky, and M. Granzow, A Practical Approach to
Microarray Data Analysis. Norwell, MA: Kluwer, 2003.
[7] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, Cluster
analysis and display of genome-wide expression patterns, Proc. Nat.
Acad. Sci. USA, vol. 95, no. 25, pp. 1486314868, 1998.
[8] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church,
Systematic determination of genetic network architecture, Nat. Genet.,
vol. 22, no. 3, pp. 281285, 1999.
[9] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,
E. S. Lander, and T. R. Golub, Interpreting patterns of gene expression
with self-organizing maps: methods and application to hematopoietic dif-
ferentiation, Proc. Nat. Acad. Sci. USA, vol. 96, no. 6, pp. 29072912,
1999.
[10] L. Tari, C. Baral, and S. Kim, Fuzzy c-means clustering with prior bio-
logical knowledge, J. Biomed. Inf., vol. 42, no. 1, pp. 7481, 2009.
[11] J. Cheng, M. Cline, J. Martin, D. Finkelstein, T. Awad, D. Kulp, and
M. A. Siani-Rose, A knowledge-based clustering algorithm driven by
gene ontology, J. Biopharm. Stat., vol. 14, no. 3, pp. 687700, 2004.
[12] L. Jinze and W. Wang, A framework for ontology-driven subspace clus-
tering, in Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discov. Data
Mining, 2004, pp. 623628.
[13] Z. Fang, J. Yang, Y. Li, Q. Luo, and L. Liu, Knowledge guided analysis
of microarray data, J. Biomed. Inf., vol. 39, no. 4, pp. 401411, 2006.
[14] D. Huang and W. Pan, Incorporating biological knowledge into distance-
based clustering analysis of microarray gene expression data, Bioinfor-
matics, vol. 22, no. 10, pp. 12591268, 2006.
1252 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
[15] Y. Lu and J. Han, Cancer classication using gene expression data, Inf.
Syst., vol. 28, no. 4, pp. 243268, 2003.
[16] B. Scholkopf, K. Tsuda, and J. P. Vert, Support vector machine appli-
cations in computational biology, in Kernel Methods in Computational
Biology. Cambridge, MA: MIT Press, 2004, pp. 7192.
[17] D. L. Hartl and E. W. Jones, Genetics: Analysis of Genes and Genomes,
6th ed. Sudbury, MA: Jones & Bartlett, 2005.
[18] Y. Tu, G. Stolovitzky, and U. Klein, Quantitative noise analysis for gene
expression microarray experiments, Proc. Nat. Acad. Sci. USA, vol. 99,
no. 22, pp. 1403114036, 2002.
[19] N. Cristianini and J. Shawe-Tahlor, An Introduction to Support Vector
Machines and other Kernel-based Learning Methods. Cambridge, U.K.:
Cambridge Univ. Press, 2002.
[20] V. N. Vapnik, Statistical Learning Theory. Berlin, Germany: Springer-
Verlag, 1998.
[21] L. A. Zadeh, Fuzzy sets, Inf. Control, vol. 8, pp. 338353, 1965.
[22] L. A. Zadeh, Fuzzy logic and approximate reasoning, Synthese, vol. 30,
pp. 407428, 1975.
[23] L. A. Zadeh, A theory of approximate reasoning, Mach. Intell., vol. 9,
pp. 149194, 1979.
[24] P. J. Woolf and Y. Wang, A fuzzy logic approach to analyzing gene
expression data, Physiol. Genomics, vol. 3, no. 1, pp. 915, 2000.
[25] H. Ressom, R. Reynolds, and R. S. Varghese, Increasing the efciency
of fuzzy logic-based gene expression data analysis, Physiol. Genomics,
vol. 13, no. 2, pp. 107117, 2003.
[26] Y. Tang, Y. Q. Zhang, Z. Huang, X. Hu, and Y. Zhao, Recursive fuzzy
granulation for gene subsets extraction and cancer classication, IEEE
Trans. Inf. Technol. Biomed., vol. 12, no. 6, pp. 723730, Nov. 2008.
[27] S. Mitra and T. Acharya, Data Mining: Multimedia, Soft Computing, and
Bioinformatics. New York: Wiley, 2003.
[28] W. J. Ewens and G. R. Grant, Statistical Methods in Bioinformatics.
Berlin, Germany: Springer-Verlag, 2005.
[29] S. J. Haberman, The analysis of residuals in cross-classied tables,
Biometrics, vol. 29, pp. 205220, 1973.
[30] K. C. C. Chan and A. K. C. Wong, A statistical technique for extracting
classicatory knowledge from databases, in Knowledge Discovery in
Databases. Cambridge, MA: AAAI/MIT Press, 1991, pp. 107123.
[31] K. C. C. Chan, A. K. C. Wong, and D. K.Y. Chiu, Learning sequential
patterns for probabilistic inductive prediction, IEEE Trans. Syst., Man
Cybern., vol. 24, no. 10, pp. 15321547, Oct. 1994.
[32] D. B. Osteyee and I. J. Good, Information, Weight of Evidence, the Sin-
gularity between Probability Measures and Signal Detection. Berlin,
Germany: Springer-Verlag, 1974.
[33] H. W. Mewes, D. Frishman, U. Guldener, G. Mannhaupt, K. Mayer,
M. Mokrejs, B. Morgenstern, M. Munsterkotter, S. Rudd, and B. Weil,
MIPS: A database for genomes and protein sequences, Nucleic Acids
Res., vol. 30, pp. 3134, 2002.
[34] E. Allwein, R. Shapire, and Y. Singer, Reducing multiclass to binary:
A unifying approach for margin classier, in Proc. Int. Conf. Mach.
Learning, 2000, pp. 916.
[35] V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J. C. F. Lee, J. M.
Trent, L. M. Staudt, J. Hudson, M. S. Boguski, D. Lashkari, D. Shalon, D.
Botstein, and P. O. Brown, The transcriptional program in the response
of human broblast to serum, Science, vol. 283, pp. 8387, 1999.
[36] R. Shamir, A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan,
Y. Shiloh, and R. Elkon, EXPANDER an integrative program suite for
microarray data analysis, BMC Bioinf., vol. 6, art. no. 232, 2005.
[37] F. D. Gibbons and F. P. Roth, Judging the quality of gene expression-
based clustering methods using gene annotation, Genome Res., vol. 12,
no. 10, pp. 15741581, 2002.
Patrick C. H. Ma received the B.A. and Ph.D. de-
grees in computer science from the Hong Kong Poly-
technic University, Hong Kong, China.
His research interests include bioinformatics and
computational biology, data mining, computational
intelligence, and business intelligence.
Keith C. C. Chan received the B.Math. degree in
computer science and statistics, and the M.A.Sc. and
Ph.D. degrees in systems design engineering fromthe
University of Waterloo, Waterloo, ON, Canada.
He was at International Business Machines Cor-
poration (IBM) Canada Laboratory, Toronto, ON,
Canada, where he was involved in the development
of software engineering tools. In 1993, he joined the
Department of Electrical and Computer Engineering,
Ryerson University, Toronto, as an Associate Profes-
sor. In 1994, he joined the Department of Computing,
The Hong Kong Polytechnic University, Hong Kong, China, where he is cur-
rently a Professor. He is also a Guest Professor of the Graduate School and an
Adjunct Professor of the Institute of Software, Chinese Academy of Sciences,
Beijing, China. He was a consultant to government agencies and various com-
panies in Hong Kong, China, Singapore, Malaysia, and Canada. His research
interests include data mining, computational intelligence, bioinformatics, and
software engineering.

Das könnte Ihnen auch gefallen