You are on page 1of 5

International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169

Volume: 6 Issue: 8 24 - 28
Diagnosis and Prognosis of Breast Cancer Using Multi Classification

R. Shyamala Prof. R. Maruthi

M.C.A, (MPhil) MCA, M.Phil, Ph.D
Research Scholar, Professor,
Dept. Of Computer Science Department of Computer Science,
Prist University, Tanjore Prist University
Email id: Email
Ph: 7338788962

Abstract: Data mining is the process of analysing data from different views points and condensing it into useful information. There are several
types of algorithms in data mining such as Classification algorithms, Regression,Segmentation algorithms, Association algorithms, Sequence
analysis algorithms, etc.,. The classification algorithm can be usedto bifurcate the data set from the given data set and foretell one or more discrete
variables, based on the other attributes in the dataset. The ID3 (Iterative Dichotomiser 3) algorithm is an original data set S as the root node. An
unutilised attribute of the data set S calculates the entropy H(S) (or Information gain IG (A)) of the attribute. Upon its selection, the attribute
should have the smallest entropy (or largest information gain) value. A genetic algorithm (GA) is aheuristic quest that imitates the process
of natural selection. Genetic algorithm can easily select cancer data set, from the given data set using GA operators, such as mutation, selection,
and crossover. A method existed earlier (KNN+GA) was not successful for breast cancer and primary tumor. Our method of creating new
algorithm GA+ID3 easily identifies breast cancer data set from the given data set. The multi classification algorithm diagnosis and prognosis of
breast cancer data set is identified by this paper.

Keywords: Data mining, Classification algorithm, Genetic algorithm, Decision tree(ID3), medical data set.


I. Introduction women are increasing in number. A new global study

estimates that by 2030 in India increasing of breast cancer
Data mining is the computational process of
from 120,000 to around 200,000 per year. Cancer is a type of
discovering patterns in large data sets . A method at the
diseases which cases the cells of the body to change its
intersection of artificial intelligence, machine learning,
characteristics and cause abnormal growth of cells. Early
statistics, and database systems. The overall goal of the data
detection of breast cancer is essential in reducing life losses.
mining process is to extract information from a data set and
transform it into an understandable structure for further use. A prognosis is an estimate of the likely course and
Data mining is the analysis step of the "knowledge discovery outcome of a disease. The prognosis of a patient diagnosed
in databases" process, or KDD. Data mining involves six with cancer is often viewed as the chance that the disease will
common classes of tasks, such as anomaly detection, be treated successfully and that the patient will recover.
association rule mining, clustering, classification, regression. Prognostic statements are announcements containing
Classification method one of the most techniques classified prognostic information. Prognostic factors are pieces of
for large medical data set.Data mining techniques are information associated with a specific outcome of disease,
implemented together to create a novel method to diagnosis which can be utilized in the formulation of the prognosis.
and prognosis of breast cancer for particular patient. Genetic
This paper is structured as follows: section 2 the
based ID3 algorithm is a very simplest algorithm and easily
review concepts of pre processing method, Genetic
diagnosis and prognosis of cancer could be done from the
algorithm,ID3 and breast cancer. Section 3 existed method.
given data set. Decision tree classifier does not require any
Section 4 explains our proposed method. Section 5 Results
domain knowledge or parameter setting. They can handle
are discussed and conclusion part as section 6.
multidimensional data and are simple and past. There are
many decision tree algorithms such as CART, ID3, C4.5,
II. Basic concepts
The pre processing method using data mining
Breast cancer is considered a major health problem techniques identify the target data from the large data set.
in men and women. In India, breast cancer cases in men and The pre processing method has been some tasks, such as data
cleaning, Data integration, Data transformation, Data
IJRITCC | August 2018, Available @
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 8 24 - 28
reductionon, Data discretization.Data cleaning: fill in missing • Tournament selection
values, smooth noisy data, identify or remove outliers, and • Rank selection
resolve inconsistencies.Data integration: using multiple • Steady state selection
databases, data cubes, or files.Data transformation:
• Truncation selection
normalization and aggregation.Data reduction: reducing the
• Local selection
volume but producing the same or similar analytical results.
Data discretization: part of data reduction, replacing
numerical attributes with nominal ones. 2) Crossover:
A genetic algorithm (GA) is a search heuristic that
mimics the process of natural selection. This heuristic (also Crossover is a genetic operator used to vary the
sometimes called ametaheuristic) is routinely used to programming of a chromosome or chromosomes from one
generate useful solutions to optimization and search generation to the next. Cross over is a process of taking more
problems. Genetic algorithms belong to the larger class of than one parent solutions and producing a child solution from
evolutionary algorithms (EA), which generate solutions to them. There are methods for selection of the chromosomes.
optimization problems using techniques inspired by natural
evolution, such as inheritance, mutation, selection, and Various types of cross over operators are
1) Uniform crossover
Genetic algorithms are useful for search and optimization 2) Cycle crossover
problems.GA uses genetics as its model as problem solving. 3) Partially – mapped crossover
Each solution in genetic algorithm is represented through 4) The uniform partially mapped crossover
chromosomes. Chromosomes are made up of genes. The 5) Non wrapping ordered crossover
collection of all chromosomes is called population. Generally 6) Ordered crossover
three popular operators are used in GA. 7) Crossover with reduced surrogate
8) Shuffle crossover
1) Selection:
3) Mutation:
Selection is the stage of a genetic algorithm in Mutation is a genetic operator used to
which individual genomes are chosen from a population for maintain genetic diversity from one generation of a
later breeding (using the crossover operator). population of genetic algorithm chromosomes to the next. It
A generic selection procedure may be implemented as is analogous to biologicalmutation. Mutation alters one or
follows: more gene values in a chromosome from its initial state. In
mutation, the solution may change entirely from the previous
• Fitness proportionate selection (SCX) The solution. Hence GA can come to better solution by using
individual is selected on the basis of fitness. The mutation.
probability of an individual to be selected increases with Fitness value:
the fitness of the individual greater or less than its A fitness function is a particular type of objective
competitor's fitness. function that is used to summarise. Each design solution is
• Boltzmann selection commonly represented as a string of numbers.

IJRITCC | August 2018, Available @
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 8 24 - 28

Chromosomes of large medical

data set

Selection of target data set


Using Genetic operators


Related breast cancer data set

Fig: Working genetic algorithm

III. Decision tree algorithm

A decision tree is a decision support tool that uses a dataset. ID3 is the precursor to the C4.5 algorithm, and is
tree-like graph or model of decisions and their possible typically used in the machine learning and natural language
consequences, including chance event outcomes, resource processing domains. The ID3 algorithm begins with the
costs, and utility. Decision tree learning uses a decision original set as the root node. On each iteration of the
tree as a predictive model which maps observations about an algorithm, it iterates through every unused attribute of the
item to conclusions about the item's target value. Decision set and calculates the entropy H (S) (or information
trees are two types.1) classification tree 2) Regression tree. gain IG (A)) of that attribute. It then selects the attribute
Tree models where the target variable can take a finite set of which has the smallest entropy (or largest information gain)
values are called classification trees. Decision trees where the value. The set is then split by the selected attribute to
target variable can take continuous values (typically real produce subsets of the data.
numbers) are called regression trees. Information gain

Used by the ID3 tree-generation

There are many specific decision-tree algorithms:
algorithms. Information Gain is based on the
• ID3 (Iterative Dichotomiser 3) concept of Entropy from Information Theory.

• C4.5 (successor of ID3)

• CART (Classification And Regression Tree)
Breast cancer
• CHAID (CHI-squared Automatic Interaction
Detector). Performs multi-level splits when computing Breast cancer is a malignant tumor that starts in the
cells of the breast. A malignant tumor is a group
classification trees.
of cancer cells that can grow into (invade) surrounding
• MARS: extends decision trees to handle numerical tissues or spread (metastasize) to distant areas of the
data better. body.
ID3 (Iterative Dichotomiser 3) is an algorithm invented Symptoms of breast cancer (female):
by Ross Quinlan used to generate a decision tree from a
IJRITCC | August 2018, Available @
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 8 24 - 28
Breast cancer has a common medical data set: , 6 data sets were chosen from UCI Repository and heart
disease A.P was taken from various corporate hospitals in
1. Breast changes A.P. Accuracy of the heart disease is increased by 5% using
2. Bloating and GA using full training data set and 15% improvement in
3. Between-Period Bleeding accuracy for cross validation against KNN without GA.KNN
4. Skin Changes and Genetic algorithm was not successful for breast cancer
5. Blood in Your Pee or Stool and primary tumor.
V. Proposed method
6. Changes in Lymph Nodes
Our proposed approach combines GA and Decision
7. Trouble swallowing
8. Weight Loss Without Trying tree (ID3) to improve the classification accuracy of breast
9. Heartburn cancer data set. Applying Genetic algorithm for the large data
10. Mouth Changes set collection from medical centre using pre-processing
11. Fever method to identify related data set .As a result of pre-
12. Fatigue processing method and using GA operators (selection,
13. Cough crossover, mutation). Using GA operators we can get
14. Pain common attribute from medical data set. And apply Genetic
15. Belly Pain and depression results combines’ decision tree algorithm identification of
cancer data set. Classified the cancer data set combines of
IV. Existed method GA+ID3 and prognosis and diagnosis of breast cancer.
The existed method approach had been tested with 6
medical data sets and 1 non medical data set out of 7 data sets

Genetic based ID3 classification algorithm:

step 1: Load the medical data set

step 2 : Apply pre-processing method on the data set and Identify related data set
step 3: Related attribute with apply GA operators from medical data set
step 4: common data set from applying GA operators
step 5: The GA operators results with apply ID3
step 6: apply both GA+ID3 with classified data set and getting cancer data set
step 7: classified cancer data set with diagnosis and prognosis of breast cancer data set.

Accuracy of the classifier is computed as

Accuracy = No of samples correctly classified in test data

Total no.of samples in the test data

VI. Results and discussion approach is a competitive method for classification. The
proposed method using identification cancer data set and
The performance our proposed method has been tested 10 diagnosis and prognosis of breast cancer.
data sets from medical data set and 2 non medical data set.
Accuracy level increased using various data sets with genetic References
algorithm. The existed method using KNN with GA
[1] Dr.E.S.Samundeeswari (2015),”computational techniques in
algorithm might be not increased breast cancer accuracy
breast cancer diagnosis and prognosis: A Review”
level. Our creating new algorithm GA with ID3 has increased
International journal of advanced Research ,
accuracy level. This algorithm will be use to all type of
[2] k.Arutchelvan, Dr.R.Periyasamy (2015),” cancer prediction
cancer data sets. system using datamining techniques” International Research
VII. Conclusion Journal of Engineering and technology ,
[3] Hamid Karim Khani Zand (2015),” A comoparitive survey
In this paper have presented classification of breast on datamining techniques for breast cacner diagnosis and
cancer using GA with ID3 algorithm. Our proposed method prediction”,Indian journal of fundamental and applied life
improving accuracy level using given the medical data set. sciences,
Experiment results carried out on 10 data sets show that our [4] Jaimini Majali, Rishikesh Niranjan,Vinamara Phatak,Omkar
Tadakhe(2015),” data mining techniques for diagnosis and
IJRITCC | August 2018, Available @
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 6 Issue: 8 24 - 28
prognosis of cancer”,International journal of advanced computational Intelligence:Modeling Techniques and
research in computer and communication applications,Elsevier , [7] T.velmurugan (2014),”A survey on Breast cancer analysis
[5] Miss.Jahanvi joshi ,Mr.RinalDoshi,Dr.Jigar using data mining techniques”,IEEE international
Patel,”Diagnosis and prognosis breast cancer using conference on computational intelligence and computing
classification rules”, International journal of engineering Research,
research and general science volume 2,
[6] M.Akhil jabbar, B.L Deekshatulu ,Priti Chandra (2013),”
Classification of Heart disease using K-Nearest Neighbor
and Genetic algorithm”, International conference on

IJRITCC | August 2018, Available @