Beruflich Dokumente
Kultur Dokumente
Abstract: This paper proposes de definition of breast cancer diagnostic typologies by the grade-of-membership ap-
proach. This fuzzy clustering model is described theoretically, and a fixed point algorithm is used in its estimation.
An application to breast cancer diagnostic classification shows the existence of two distinct patterns. The graphical
representation of the grade-of-membership estimates confirms the good fuzzy properties of the two-cluster solu-
tion.
1, ..., J with Lj categories (Lj 2). Thus, yij The GoM model assumes local independence,
indicates the category of Yj for individual i, with i.e., the variables are independent within each cluster
1 yij Lj . Associated with each individual re- [6], hence
sponse there are Lj binary variables Yijl (i = 1, ..., n;
J
Y
j = 1, ..., J; l = 1, ..., Lj ), where yijl = I(yij = l), f (yi ; , gi ) = fj (yij ; j , gi ),
i.e. ( j=1
1, yij = l
yijl = ,
0, yij 6= l where j = {1jLj , ..., KjLj }, = {1 , ..., J },
P P P and gi = {gi1 , ..., giK }.
with Lj l=1 yijl = 1 and
J
j=1
Lj
l=1 yijl = J. Assuming that yij follows a multinomial distribu-
The fuzzy clustering Grade-of-Membership tion
(GoM) [12, 5] model is a fuzzy-set classification yij M ultiLj (1, ij1 , ..., ijLj )
approach that identifies the number, say K, of profiles Q Lj yijl
or pure types that best describe the pattern of associa- with density fij (yij ; j , gi ) = l=1 (ijl ) , we
tion between the categories of variables. The cluster have
Lj
J Y
or group of subject i is indicated by Ci {1, ..., K}, Y
fi (yi ; , gi ) = (ijl )yijl .
i = 1, ..., n.
j=1 l=1
The Grade-of-Membership (GoM) model [5] is
defined by two sets of parameters: From equation (1), the definition of the Grade-of-
Membership model for a given observation i is
1. the first one relates the K pure types and the J
Lj K
J Y
!yijl
variables Y X
fi (yi ; , gi ) = gik kjl .
j=1 l=1 k=1
kjl = Pr(Yijl = 1 | Ci = k)
PJ
i.e. it is the probability that individual i in pure There are nK + K j=1 Lj parameters to esti-
type k has the response l to the variable j; mate; the corresponding free parameters are pK =
P
n(K 1) + K Jj=1 (Lj 1) due to the constraints:
2. the second set of coefficients (gik ) relates each PL 1 j K1 P
kjLj = 1 l=1 kjl and giK = 1 k=1 gik .
observation to the K pure types, i.e., they de- Under the assumption that y1 , ..., yn are inde-
scribe how close the individual profile i is to each pendent realizations of the feature vector y, the
P
of K pure types with gik 0 and K k=1 gik = 1. likelihood function for is given by L(; y) =
Qn
Thus, the GoM model parameterizes ijl = i=1 fi (yi ; , gi ), and = {, g} represents the
1. Set K and the tolerance level (), the number of 2. Menopause: whether the patient is pre- or post-
pure types and stop tolerance level, respectively; menopausal at time of diagnosis;
(0)
set m 1; initialize gik from a random sample:
3. Tumor size: the greatest diameter (in mm) of the
(0)
gik U nif orm[0, 1] excised tumor;
(0)
(0) gik 4. Inv-nodes: the number (range 0 - 39) of axillary
gik PK (0) lymph nodes that contain metastatic breast can-
h=1 gih
cer visible on histological examination;
and
(0) 1 5. Node caps: if the cancer does metastasise to a
kjl = ;
Lj lymph node, although outside the original site of
the tumor it may remain contained by the cap-
2. Compute
sule of the lymph node. However, over time, and
1X J X Lj (m1) (m1)
gik kjl with more aggressive disease, the tumor may re-
(m)
gik = yijl P (m1) (m1) place the lymph node and then penetrate the cap-
J j=1 l=1 K
g
h=1 ih hjl sule, allowing it to invade the surrounding tis-
sues;
3. Compute
(m)
6. Degree of malignancy: the histological grade
1
kjl = (m) (m1) (range 1-3) of the tumor. Tumors that are grade
P n PLj g
ik
kjl
i=1 l=1
yijl K P (m) (m1) 1 predominantly consist of cells that, while neo-
g
h=1 ih hjl
P g
(m) (m1)
plastic, retain many of their usual characteristics.
n ik kjl
i=1 ijl y K P (m) (m1) Grade 3 tumors predominately consist of cells
h=1 ih
g hjl
that are highly abnormal;
4. Compute
7. Breast: breast cancer may obviously occur in ei-
n X Lj
J X
K ! ther breast;
X X (m) (m)
`((m) ; y) = yijl log gik kjl
i=1 j=1 l=1 k=1 8. Breast quadrant: the breast may be divided into
four quadrants, using the nipple as a central
If `K ((m) ; y) `K ((m1) ; y) < , stop; point;
otherwise m m + 1 and go to step 2.
9. Irradiation: radiation therapy is a treatment that
Since this algorithm (any iterative algorithm) can uses high-energy x-rays to destroy cancer cells.
not ensure the convergence to the global maximum for
a given value of K, the algorithm should be repeated Table 1 provides the estimates of kjl for K = 2
using different starting values. This algorithm is im- and the observed sample frequencies for each cate-
plemented in MATLAB 7 [7]. We run the algorithm gory. In fact, the GoM model provides a clear split or
100 times with initial random solutions. As stopping clustering of the sample into two groups with differ-
rule we set = 107 . ent profile: class 1 - Early cancer stage and class 2 -
Advanced cancer stage. The estimates of kjl suggest
that the pure type I contains older women aged 50 and
3 Application above, in opposition to pure type 2 characterized by
We apply the GoM model to the Ljubljana Breast women aged up to 49. This result shows that younger
Cancer Data Set with 277 instances of real patient women are less aware of the disease and most of the
data. The data set is available from UCI repository time they diagnose the problem in an advanced stage
(http://archive.ics.uci.edu/ml/datasets/Breast+Cancer) of its development, in opposition to the older women.
and contains the following variables (see categories This result is in agreement with the second variable:
in Tables 1): given the women are pure type II, the probability of
being premenopausal is 1.00.
1. Age: age (in years at last birthday) of the patient In the early stage (cluster 1) the tumor size tends
at the time of diagnosis; to be smaller, in contrast with cluster 2 in which tumor