Sie sind auf Seite 1von 5

Proceedings of the 2nd WSEAS International Conference on Multivariate Analysis and its Application in Science and Engineering

Breast cancer diagnostic typologies


by grade-of-membership fuzzy modeling
JOSE G. DIAS
ISCTE University Institute of Lisbon
Department of Quantitative Methods
Edifcio ISCTE, Av. das Forcas Armadas, 1649026 Lisboa
PORTUGAL
jose.dias@iscte.pt

Abstract: This paper proposes de definition of breast cancer diagnostic typologies by the grade-of-membership ap-
proach. This fuzzy clustering model is described theoretically, and a fixed point algorithm is used in its estimation.
An application to breast cancer diagnostic classification shows the existence of two distinct patterns. The graphical
representation of the grade-of-membership estimates confirms the good fuzzy properties of the two-cluster solu-
tion.

KeyWords: Grade-of-membership model, clustering, fuzzy partition, Breast cancer diagnostics

1 Introduction hand, in oncological studies, a database on breast can-


cer characteristics may allow the identification of dif-
Breast cancer is a lending cause of death for women ferent stages and patterns of the development of the
worldwide. Indeed, it is the most common cause of disease.
cancer mortality, accounting for 16% of cancer deaths An important aspect in therapy planning is to an-
in adult women [11]. There is strong evidence that ticipate the risk of further disease in such a way that
early detection through mammography screening and the treatment can be adjusted. Thus those women at
adequate treatment of women with a positive result higher risk should be treated with stronger and more
could significantly reduce mortality from this life- invasive treatment. This type of strategy reduces side
threatening disease. This has propelled a considerable effects of unneeded invasive treatments and saves re-
amount of research on breast cancer. sources. Here we assume that there is two groups
After diagnosis the main types of breast cancer of women with heterogeneous needs. The goal is to
remains surgery followed by radiotherapy with hor- identify the existing clustering structure in the data
monal and chemotherapeutic agents often used to treat and classify each woman into each group provided her
presumed micro-metastatic disease. Surgery removes profile. We apply the grade-of-membership model to
any local tumor and allows that a sample can be taken identifying the fuzzy clustering structure. This model
to analysis the nodes to describe the disease. The ex- has become very popular em health and related fields
amination allows the characterization of the disease (see e.g. [8, 10]).
that makes local recurrence and death more likely us- The remainder part of this paper is organized as
ing characteristics such as grade of the tumor (degree follows: Section 2 defines the conceptual fuzzy clus-
of abnormality displayed by the cells), the size of the tering framework; Section 3 provides a description of
tumor (maximum diameter, in mm) and the number of the data and its empirical analysis. The paper con-
involved nodes [9]. cludes with a summary of main findings, implications,
Clustering is the search of homogeneous subsets and suggestions for further research.
in a data set. The application of clustering algorithms
has been extensive in the context of Statistics (e.g.,
2 The Grade-of-Membership model
[3]) and Fuzzy Set Theory (e.g., [1]). For example,
in marketing, market segmentation means the identi- Let us have a data set of n objects to be clustered. An
fication of groups of customers with similar behavior object is denoted by i (i = 1, . . . , n). Each object
given a large database of customer data. On the other is characterized by J categorical variables Yj , j =

ISSN: 1790-5117 129 ISBN: 978-960-474-083-3


Proceedings of the 2nd WSEAS International Conference on Multivariate Analysis and its Application in Science and Engineering

1, ..., J with Lj categories (Lj 2). Thus, yij The GoM model assumes local independence,
indicates the category of Yj for individual i, with i.e., the variables are independent within each cluster
1 yij Lj . Associated with each individual re- [6], hence
sponse there are Lj binary variables Yijl (i = 1, ..., n;
J
Y
j = 1, ..., J; l = 1, ..., Lj ), where yijl = I(yij = l), f (yi ; , gi ) = fj (yij ; j , gi ),
i.e. ( j=1
1, yij = l
yijl = ,
0, yij 6= l where j = {1jLj , ..., KjLj }, = {1 , ..., J },
P P P and gi = {gi1 , ..., giK }.
with Lj l=1 yijl = 1 and
J
j=1
Lj
l=1 yijl = J. Assuming that yij follows a multinomial distribu-
The fuzzy clustering Grade-of-Membership tion
(GoM) [12, 5] model is a fuzzy-set classification yij M ultiLj (1, ij1 , ..., ijLj )
approach that identifies the number, say K, of profiles Q Lj yijl
or pure types that best describe the pattern of associa- with density fij (yij ; j , gi ) = l=1 (ijl ) , we
tion between the categories of variables. The cluster have
Lj
J Y
or group of subject i is indicated by Ci {1, ..., K}, Y
fi (yi ; , gi ) = (ijl )yijl .
i = 1, ..., n.
j=1 l=1
The Grade-of-Membership (GoM) model [5] is
defined by two sets of parameters: From equation (1), the definition of the Grade-of-
Membership model for a given observation i is
1. the first one relates the K pure types and the J
Lj K
J Y
!yijl
variables Y X
fi (yi ; , gi ) = gik kjl .
j=1 l=1 k=1
kjl = Pr(Yijl = 1 | Ci = k)
PJ
i.e. it is the probability that individual i in pure There are nK + K j=1 Lj parameters to esti-
type k has the response l to the variable j; mate; the corresponding free parameters are pK =
P
n(K 1) + K Jj=1 (Lj 1) due to the constraints:
2. the second set of coefficients (gik ) relates each PL 1 j K1 P
kjLj = 1 l=1 kjl and giK = 1 k=1 gik .
observation to the K pure types, i.e., they de- Under the assumption that y1 , ..., yn are inde-
scribe how close the individual profile i is to each pendent realizations of the feature vector y, the
P
of K pure types with gik 0 and K k=1 gik = 1. likelihood function for is given by L(; y) =
Qn
Thus, the GoM model parameterizes ijl = i=1 fi (yi ; , gi ), and = {, g} represents the

Pr(Yijl = 1) as GoM parameters, with g = {g1 , ..., gn }. Thus, the


likelihood function of the GoM model is
Lj K !yijl
K
X n Y
Y J Y X
ijl = gik Pr(Yijl = 1 | Ci = k)
L(; y) = gik kjl ,
k=1
i=1 j=1 l=1 k=1
XK
= gik kjl , (1) and the log-likelihood function for K pure types
k=1 (`K (; y) = log L(; y)) is
where ijl represents the (unconditional) probability Lj K !
n X
X J X X
that individual i has level l on variable j, and coef- `K (; y) = yijl log gik kjl . (2)
ficients gik are mixed weights indicating the degree i=1 j=1 l=1 k=1
to which a given individual is represented by each
of K classes. These parameters and their constraints The maximum likelihood estimate of is pro-
determine the geometry of the space, in which each vided by the score equation
observation is represented as a convex combination `K (; y)
of K coordinates defining the extreme points in the = 0.

space and being referred to as pure types or ex-
treme points. In a clustering approach, they can also The GOM model is estimated by the fixed point
be interpreted as cluster or segment centroids [4]. algorithm defined as follows:

ISSN: 1790-5117 130 ISBN: 978-960-474-083-3


Proceedings of the 2nd WSEAS International Conference on Multivariate Analysis and its Application in Science and Engineering

1. Set K and the tolerance level (), the number of 2. Menopause: whether the patient is pre- or post-
pure types and stop tolerance level, respectively; menopausal at time of diagnosis;
(0)
set m 1; initialize gik from a random sample:
3. Tumor size: the greatest diameter (in mm) of the
(0)
gik U nif orm[0, 1] excised tumor;
(0)
(0) gik 4. Inv-nodes: the number (range 0 - 39) of axillary
gik PK (0) lymph nodes that contain metastatic breast can-
h=1 gih
cer visible on histological examination;
and
(0) 1 5. Node caps: if the cancer does metastasise to a
kjl = ;
Lj lymph node, although outside the original site of
the tumor it may remain contained by the cap-
2. Compute
sule of the lymph node. However, over time, and
1X J X Lj (m1) (m1)
gik kjl with more aggressive disease, the tumor may re-
(m)
gik = yijl P (m1) (m1) place the lymph node and then penetrate the cap-
J j=1 l=1 K
g
h=1 ih hjl sule, allowing it to invade the surrounding tis-
sues;
3. Compute
(m)
6. Degree of malignancy: the histological grade
1
kjl = (m) (m1) (range 1-3) of the tumor. Tumors that are grade
P n PLj g
ik

kjl
i=1 l=1
yijl K P (m) (m1) 1 predominantly consist of cells that, while neo-
g
h=1 ih hjl
P g
(m) (m1)
plastic, retain many of their usual characteristics.
n ik kjl
i=1 ijl y K P (m) (m1) Grade 3 tumors predominately consist of cells
h=1 ih
g hjl
that are highly abnormal;
4. Compute
7. Breast: breast cancer may obviously occur in ei-
n X Lj
J X
K ! ther breast;
X X (m) (m)
`((m) ; y) = yijl log gik kjl
i=1 j=1 l=1 k=1 8. Breast quadrant: the breast may be divided into
four quadrants, using the nipple as a central

If `K ((m) ; y) `K ((m1) ; y) < , stop; point;
otherwise m m + 1 and go to step 2.
9. Irradiation: radiation therapy is a treatment that
Since this algorithm (any iterative algorithm) can uses high-energy x-rays to destroy cancer cells.
not ensure the convergence to the global maximum for
a given value of K, the algorithm should be repeated Table 1 provides the estimates of kjl for K = 2
using different starting values. This algorithm is im- and the observed sample frequencies for each cate-
plemented in MATLAB 7 [7]. We run the algorithm gory. In fact, the GoM model provides a clear split or
100 times with initial random solutions. As stopping clustering of the sample into two groups with differ-
rule we set = 107 . ent profile: class 1 - Early cancer stage and class 2 -
Advanced cancer stage. The estimates of kjl suggest
that the pure type I contains older women aged 50 and
3 Application above, in opposition to pure type 2 characterized by
We apply the GoM model to the Ljubljana Breast women aged up to 49. This result shows that younger
Cancer Data Set with 277 instances of real patient women are less aware of the disease and most of the
data. The data set is available from UCI repository time they diagnose the problem in an advanced stage
(http://archive.ics.uci.edu/ml/datasets/Breast+Cancer) of its development, in opposition to the older women.
and contains the following variables (see categories This result is in agreement with the second variable:
in Tables 1): given the women are pure type II, the probability of
being premenopausal is 1.00.
1. Age: age (in years at last birthday) of the patient In the early stage (cluster 1) the tumor size tends
at the time of diagnosis; to be smaller, in contrast with cluster 2 in which tumor

ISSN: 1790-5117 131 ISBN: 978-960-474-083-3


Proceedings of the 2nd WSEAS International Conference on Multivariate Analysis and its Application in Science and Engineering

in pure type I the answer is no, because of its early


kjl )
Table 1: GoM model estimates ( stage. Pure type II is more heterogeneous as it con-
tains all the cases yes, and still some answers no.
Variables GoM Model Frequency
Pure type 1 Pure type 2 The degree of malignancy in pure type I tends to
Age be low, in opposition to pure type II in which it is at
10-19 0.00 0.00 0.00
20-29 0.00 0.01 0.36 least medium degree. Note that the propensity for lev-
30-39 0.00 0.29 13.00 els medium and high is two times more likely in pure
40-49 0.00 0.71 32.13
50-59 0.60 0.00 32.85
type II than in pure type I.
60-69 0.36 0.00 19.86 For the variables associated with the location of
70-79 0.03 0.00 1.81 the cells with cancer Breast and Breast-quad , the
Menopause
lt40 0.04 0.00 1.81 results tend to be heterogeneous, in each breast and
ge40 0.96 0.00 44.40 in each quadrant of each breast. Thus, these variables
premeno 0.00 1.00 53.79
Tumor-size cannot discriminate the two pure types.
0-4 0.06 0.00 2.89 Finally, irradiation shows that women in cluster
5-9 0.03 0.00 1.44
10-14 0.20 0.00 10.11
I were not treated with high-energy x-rays. The sec-
15-19 0.17 0.04 10.47 ond pure type a subgroup was treat to irradiation to
20-24 0.15 0.20 17.33 destroy cancer cells, and probably for 56% the tumor
25-29 0.13 0.24 18.41
30-34 0.15 0.26 20.58 was already widespread that did not justify the treat-
35-39 0.00 0.14 6.86 ment given the cost-benefit trade-off.
40-44 0.09 0.07 7.94
45-49 0.00 0.02 1.08 Now we focus our analysis on the grade-of-
50-54 0.03 0.03 2.89 membership parameter estimates ( gik ). Figure 1 de-
Inv-nodes
0-2 1.00 0.52 75.45
picts the grade-of-memberships for the 277 women on
3-5 0.00 0.24 12.27 a scatter plot. For K = 2, one has gi1 + gi2 = 1, and
6-8 0.00 0.12 6.14 therefore the observations fall on the line (gi1 , 1gi1 ).
9-11 0.00 0.05 2.53
12-14 0.00 0.02 1.08 The Figure gives as well the frequency in each point.
15-17 0.00 0.04 2.17 Overall, a very small proportion of observations oc-
18-20 0.00 0.00 0.00
21-23 0.00 0.00 0.00 cupies the interior of the interval. Indeed most of the
24-26 0.00 0.01 0.36 grades-of-membership values are extreme (0, 1) or
Node-caps (1, 0) , which shows a good level of separation of the
No 1.00 0.60 79.78
Yes 0.00 0.40 20.22 two clusters.
Deg-malig To quantify the level of separation of the two
Low 0.46 0.00 23.83
Medium 0.33 0.61 46.57 groups we use the entropy or fuzzyness in the parame-
High 0.21 0.39 29.60 ters gik . This approach is rather popular in the context
Breast
Left 0.56 0.49 52.35
of mixture models (see e.g. [2]). For GoM models,
Right 0.44 0.51 47.65 the relative entropy is given by
Breast-quad
Left-up 0.36 0.32 33.94 Xn X K
1
Left-low 0.37 0.39 38.27 EK = 1 + gik log gik .
Right-up 0.08 0.16 11.91 n log K i=1 k=1
Right-low 0.04 0.13 8.30
Central 0.15 0.00 7.58
EK varies between 0 and 1. Assuming 0log0 = 0, in
Irradiation
No 1.00 0.56 77.62 a hard partition with gik {0, 1}, gik log gik = 0, and
Yes 0.00 0.44 22.38 consequently EK = 1. Complete fuzzyness means
gik = 1/K, resulting in EK = 1. In our case,
EK = 0.59 and consequently this two-cluster typol-
ogy provides a good level of separation.
size is larger than 15 mm. This confirms that the num-
ber of axillary lymph nodes that contains metastatic
4 Conclusion
breast cancer visible on histological examination (Inv-
nodes) are in cluster 1 in maximum 2, in opposition In this paper we focused on the Grade-of-Membership
to cluster 2. Node Caps indicates whether the cancer model as a breast cancer diagnostic tool. This type of
does metastasise to a lymph node. One concludes that computer-based system for identification of the breast

ISSN: 1790-5117 132 ISBN: 978-960-474-083-3


Proceedings of the 2nd WSEAS International Conference on Multivariate Analysis and its Application in Science and Engineering

[3] J.G. Dias and M.J. Cortinhal, The SKM algo-


rithm: A K-means algorithm for clustering se-
quential data, Advances in Artificial Intelligence
IBERAMIA 2008, Lecture Notes in Artificial In-
telligence, pp. 173-182, SpringerVerlag, Berlin
2008
[4] A. Maetzel, S.H. Johnson, M.A. Woodbury, C.
Bombardier, Use of grade of membership anal-
ysis to profile the practice styles of individ-
ual physicians in the management of acute low
back pain, Journal of Clinical Epidemiology, 53,
2000, pp. 195-205.
[5] K.G. Manton and M.A. Woodbury, A new proce-
dure for analysis of medical classification, Meth-
ods of Information in Medicine, 21, 1982, pp.
210-220.
[6] K.G. Manton, M.A. Woodbury, E. Stallard, L.S.
Figure 1: Scatter plot representation of the grade-of- Corder, The use of grade-of-membership tech-
membership estimates niques to estimate regression relationships, So-
ciological Methodology, 22, 1992, pp. 321-381.
[7] MathWorks, MATLAB 7.0, The MathWorks,
cancer patterns can be very useful in diagnosis and Natick, MA 2004
management of the disease progression. After pro- [8] P. McNamee, A comparison of the grade of
viding a short introduction on the importance of these membership measure with alternative health in-
tools in cancer research, we gave an overview of the dicators in explaining costs for older people,
Grade-of-Membership model and its estimation. We Health Economics, 13(4), 2004, pp. 379 - 395.
illustrate its performance in the understanding of the
[9] M.A. Richards, I.E. Smith, and J.M. Dixon,
relation between variables collected in a cancer tumor
Role of systemic treatment for primary operable
diagnostic study. The two-cluster solution provides
breast cancer, BMJ, 309, 1994, pp. 12631366.
a good separation into the two groups of women set
[10] S. Szadoczky, S. Rozsa, S. Patten, M. Arato, and
a priori. The pure types were well described by the
J. Furedi, Lifetime patterns of depressive symp-
variables that we would expect a priori (in Table 1).
toms in the community and among primary care
Future research can extend the model for predicting
attenders: an application of grade of member-
recurrence in breast cancer from the problem in breast
ship analysis, Journal of Affective Disorders, 77,
cancer prognosis into two categories no-recurrence-
2003, pp. 31-39.
events and recurrence-events from this clustering
setting. [11] WHO, World Health Statistics 2008, World
Health Organization, Geneva, Switzerland 2008
[12] M.A. Woodbury, J. Clive, Clinical pure types
References as a fuzzy partition, Journal of Cybernetics, 4,
1974, pp. 111-121.
[1] J.C. Bezdek, Pattern Recognition with Fuzzy
Objective Function Algoritms, Plenum Press,
New York 1981
[2] J.G. Dias and J.K. Vermunt, Bootstrap meth-
ods for measuring classification uncertainty in
latent class models. In A. Rizzi and M. Vichi
(eds.), COMPSTAT2006. Proceedings in Com-
putational Statistics, pp. 31-41, Heidelberg,
Physica/SpringerVerlag 2006

ISSN: 1790-5117 133 ISBN: 978-960-474-083-3

Das könnte Ihnen auch gefallen