Sie sind auf Seite 1von 21

Breast cancer detection through histopathology image classification using deep learning

techniques

Abstract

Classification of breast cancer has been the topic of interest in the fields of healthcare and
bioinformatics, because it is the second main cause of cancer-related deaths in women. Breast
cancer can be identified using a biopsy where tissue is removed and studied under microscope.
The diagnosis is based on the qualification of the histopathologist, who will look for abnormal
cells. However, if the histopathologist is not well-trained, this may lead to wrong diagnosis. With
the recent advances in image processing and machine learning, there is an interest in attempting
to develop a reliable pattern recognition based systems to improve the quality of diagnosis. The
proposed system identifies breast cancer using automatic classification of breast cancer histology
images into benign and malignant, this can be achieved by design of convolutional neural
networks. The experiment study shows that convolutional neural network achieves high accuracy
on classification.

Introduction

Breast cancer is the most common invasive cancer in women and the second main cause of
cancer death in women, after lung cancer. According to the International Agency for Research on
Cancer (IARC), which is part of the World Health Organization (WHO), the numbers of deaths
caused by cancer in the year of 2012 alone come to around 8.2 million. The number of new cases
is expected to increase to more than 27 million by 2030.

Breast cancer can be diagnosed using medical images testing, like histology and radiology
images. The radiology images analysis can help to identify the areas where the abnormality is
located. However, they cannot be used to determine whether the area is cancerous. The biopsy,
where a tissue is taken and studied under a microscope to see if cancer is present, is the only sure
way to identify if an area is cancerous. After completing the biopsy, the diagnosis will be based
on the qualification of the histopathologists, who will examine the tissue under a microscope,
looking for abnormal or cancerous cells. The histology images allow us to distinguish the cell
nuclei types and their architecture according to a specific pattern.

Problem statement

Histopathologists visually examine the regularities of cell shapes and tissue distributions and
determine cancerous regions and malignancy degree. If the histopathologists are not well-trained,
this may lead to an incorrect diagnosis. Also, there is a lack of specialists, which keeps the tissue
sample on hold for up to two months, for example, this occurs often in Norway. Therefore, there
is an insistent demand for computer-assisted diagnosis.

Objectives

The objective of study is to develop an application, which trains the convolutional neural
network and classify the breast cancer disease as benign and malignant. The necessary image
pre-processing steps are handled before training the images into CNN architectures. The main
objective is to classify the image with high accuracy.

Methodologies

The methodology of work involves the image pre-processing such as RGB conversion and
segmentation if required. Feature extraction of image and training using CNN algorithm. Give
the test image and classify as benign and malignant.

Expected results

On the training phase, the necessary output files such as pickle or h5 file is creased. The
expected output of the proposed work is given test image is classified as benign and malignant.
Literature survey

Computer aided diagnosis in digital pathology application: Review and perspective


approach in lung cancer classification

Abbas, Abbas K. & B. Sideseq, Fahad & Faeq Hussein, Ahmed & Basil, Mena. (2017),
219-224. 10.1109/NTICT.2017.7976109.

The author reviewed important algorithms used in the CAD application for lung tissue
diagnostics and highlighted the performance of each distinctive algorithm. Moreover, ROC
characteristics have been made for each selected algorithms (support vector machine (SVM),
Fuzzy C-mean (FCM), Conventional Neural network (CNN) and CADFCM).The features for
each algorithm discussed and related performance in clinical aided diagnosis (CAD) discussed
and explained. Moreover, the author also performed comparison of different research groups to
spotlight each criterion for different algorithms and approach used in CAD platforms in lung
cancer.

A Dataset for Breast Cancer Histopathological Image Classification

F. A. Spanhol, L. S. Oliveira, C. Petitjean and L. Heutte, IEEE Transactions on Biomedical


Engineering, vol. 63, no. 7, pp. 1455-1462, July 2016.

The author used a dataset of 7909 breast cancer histopathology images acquired on 82 patients.
The dataset includes both benign and malignant images. The task associated with this dataset is
the automated classification of these images in two classes, which would be a valuable
computer-aided diagnosis tool for the clinician. The author achieved accuracy ranges from 80%
to 85%. By providing this dataset and a standardized evaluation protocol to the scientific
community, they gathered researchers in both the medical and the machine learning field to
advance toward this clinical application.
A Deep Feature Based Framework for Breast Masses Classification

Jiao, Zhicheng & Gao, Xinbo & Wang, Ying & Li, Jie. (2016), Neurocomputing. 197.
10.1016/j.neucom.2016.02.060.

The author designed a deep feature based framework for breast mass classification task. It
mainly contains a convolutional neural network (CNN) and a decision mechanism. Combining
intensity information and deep features automatically extracted by the trained CNN from the
original image, proposed method could better simulate the diagnostic procedure operated by
doctors and achieved state-of-art performance. In this framework, doctors‫ ׳‬global and local
impressions left by mass images were represented by deep features extracted from two different
layers called high-level and middle-level features. Meanwhile, the original images were regarded
as detailed descriptions of the breast mass. Then, classifiers based on features above were used in
combination to predict classes of test images. And outcomes of classifiers based on different
features were analyzed jointly to determine the types of test images.

Convolutional neural networks for mammography mass lesion classification

J. Arevalo, F. A. González, R. Ramos-Pollán, J. L. Oliveira and M. A. Guevara Lopez,


2015 37th Annual International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC), Milan, 2015, pp. 797-800.

The author presented an evaluation of convolutional neural networks to learn features for
mammography mass lesions before feeding them to a classification stage. Experimental results
showed that this approach is a suitable strategy outperforming the state-of-the-art representation
from 79.9% to 86% in terms of area under the ROC curve.
Proposed Architecture, Techniques, Algorithm

SYSTEM ARCHITECTURE

Pre-process Split Train


Histopathology set

Dataset

X-Train Y_Train

CNN
algorithm

Pre-process Trained
Test Input
Model

Classify
benign/
malignant
Techniques and Algorithm

 Dataset collection

 Image pre-processing

 Training using Convolutional 2D neural network

 Recognition

Evaluation

The evaluation of the proposed results may include the classification output as benign and
malignant.

Conclusion

Convolutional Neural Network with changing the parameter and testing it on dataset image of
breast cancer using deep learning frame work TensorFlow. With the help of deep learning
technique and Convolutional Neural Architecture, we have extracted the features of an image
and have classified the image into benign and malignant tumor. It is observed that the
classification accuracy mainly depends on how CNN extracts and learns the feature in different
layers with the variation in parameter. In the proposed system, efficiency is pretty good, still
there is a room for improvement.
REQUIREMENT ANALYSIS

Computer Aided learning is a rapidly growing dynamic area of research in tumor and cancer
detection industry. The recent researchers in machine learning promise the improved
accuracy of perception of tumor and cancer detection. Here the computers are enabled to
think by developing intelligence by learning. There are many types of Machine
Learning Techniques and which are used to classify the data sets.
Functional Requirements

The proposed application should be able to identify whether the input image has the cancer or
not. Functional requirements define a function of a system and its components. A function is
described as a set of inputs, the behavior and its outputs.

Functionality

The application is developed in such a way that any future enhancement can be easily
implementable. The project is developed in such a way that it requires minimal maintenance. The
software used are open source and easy to install. The application developed should be easy to
install and use.

Reliability

It is the maturity, fault tolerance and recoverability. The system is reliable for any number of user
input and training dataset.
Usability

It is easy to understand, learn and operate the software system. The user can give input image
and classify as cancer or normal image.

Safety

Safety-critical is issues associated with its integrity level. The computer system being used is
protected by a password.

Security
It does not block the some available ports through the Windows firewall. The web camera port
should be enable automatically, otherwise user must enable every time.

Robustness

The application is developed in such a way that any future enhancement can be easily
implementable. The project is developed in such a way that it requires minimal maintenance. The
software used are open source and easy to install. The application developed should be easy to
install and use.

Communications

The application is developed in such a way that communication can be handled through GUI for
the user and input to computer.

Non Functional Requirements

Non-functional requirements determine the resources required, time interval, transaction


rates, throughput and everything that deals with the performance of the system.

Maintainability

It is easy to maintain the system as it does not require any special maintenance after download.
Updates are required only if notified to the user about any. Easy maintenance is one among the
features that makes this proposal most usable.
Portability

The software must easily be transferred to another environment, including install ability.
It is easily portable as it is implied on a regular computer. The user can access the computer from
the place where the system was installed.

Performance

Less time for detection of disease once the input is arrived. Similarly, the training time also less
as we given limited epoch on training.

Accuracy
The accuracy generated by our work is outperformed than any other existing models. We
can detect the disease accurately through our proposed system.

Software Analysis

The project primarily focuses on sign language detection. We implemented with Python 3
version. The libraries required are to installed prior to execute the project. We installed CV2 for
OpenCV, Keras, TesorFlow , numpy, etc.

Hardware Requirements

Processor : Any Processor above 500 MHz.

Ram : 4 GB

Hard Disk : 250 GB

Input device : Standard Keyboard and Mouse, Web Camera

Output device : High Resolution Monitor.

Software Specification

Operating System : Windows 7 or higher

Programming : Python 3.6 and related libraries


DESCRIPTIONS OF VARIOUS COMPONENTS OF THE SYSTEM

PROGRAMMING LANGUAGE - PYTHON

Python is an interpreted high-level programming language for general-purpose programming.


Created by Guido van Rossum and first released in 1991, Python has a design philosophy that
emphasizes code readability, notably using significant whitespace. It provides constructs that
enable clear programming on both small and large scales.

Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and
has a large and comprehensive standard library.

Python interpreters are available for many operating systems. CPython, the reference
implementation of Python, is open source software and has a community-based development
model, as do nearly all of its variant implementations. CPython is managed by the non-profit
Python Software Foundation.

Scikit-learn

Scikit-learn is the simple and efficient tools for data mining and data analysis. Scikit-learn is a
free software machine learning library for the Python programming language. It features various
classification, regression and clustering algorithms including support vector machines, random
forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the
Python numerical and scientific libraries NumPy and SciPy.

Some popular groups of models provided by scikit-learn include:

Clustering: for grouping unlabeled data such as KMeans.

Cross Validation: for estimating the performance of supervised models on unseen data.

Datasets: for test datasets and for generating datasets with specific properties for investigating
model behavior.

Dimensionality Reduction: for reducing the number of attributes in data for summarization,
visualization and feature selection such as Principal component analysis.
Ensemble methods: for combining the predictions of multiple supervised models.

Feature extraction: for defining attributes in image and text data.

Feature selection: for identifying meaningful attributes from which to create supervised models.

Parameter Tuning: for getting the most out of supervised models.

Manifold Learning: For summarizing and depicting complex multi-dimensional data.

Supervised Models: a vast array not limited to generalized linear models, discriminate analysis,
naive bayes, lazy methods, neural networks, support vector machines and decision trees.
PROPOSED METHODOLOGY

Data collection

The data collection process involves the selection of quality data for analysis. Here we used
ICIAR2018_BACH_dataset from https://iciar2018-challenge.grand-challenge.org/Dataset/ for deep
learning implementation. The four classes of dataset is considered Normal: 100, Benign: 100, in
situ carcinoma: 100, Invasive carcinoma: 100. The job of a data analyst is to find ways and
sources of collecting relevant and comprehensive data, interpreting it, and analyzing results with
the help of statistical techniques.

Data preprocessing

The purpose of preprocessing is to convert raw data into a form that fits machine / deep learning.
Structured and clean data allows a data scientist to get more precise results from an applied
machine learning model. The technique includes data formatting, cleaning, and sampling for text
data. Gray color conversion, segmentation and size reduction are some of the pre-processing
techniques for image dataset.

Dataset splitting

A dataset used for machine learning should be partitioned into three subsets — training, test, and
validation sets.

Training set. A data scientist uses a training set to train a model and define its optimal parameters
it has to learn from data.

Test set. A test set is needed for an evaluation of the trained model and its capability for
generalization. The latter means a model’s ability to identify patterns in new unseen data after
having been trained over a training data. It’s crucial to use different subsets for training and
testing to avoid model overfitting, which is the incapacity for generalization we mentioned
above.

Model training
After a data scientist has preprocessed the collected data and split it into train and test can
proceed with a model training. This process entails “feeding” the algorithm with training data.
An algorithm will process data and output a model that is able to find a target value (attribute) in
new data an answer you want to get with predictive analysis. The purpose of model training is to
develop a model.

Model evaluation and testing

The goal of this step is to develop the simplest model able to formulate a target value fast and
well enough. A data scientist can achieve this goal through model tuning. That’s the optimization
of model parameters to achieve an algorithm’s best performance.
DECOMPOSING A SYSTEM

The proposed system is decomposed to the following sub-systems

 Image pre-processing

 Training using convolutional neural networks

 Disease detection

IMAGE PRE-PROCESSING

Image pre-processing is carried out into steps such as color conversion and Gaussian blurring.
Color conversion function converts input image from one color space to other, here we used
BGR2GRAY for converting the input image to gray scale image. The next step of pre-process is
Gaussian blurring of images. Gaussian blurring removes noises from images and smoothes the
images. For image segmentation, Adaptive Gaussian Threshold is applied and Threshold is
calculated for every small region of images.

TRAINING USING CONVOLUTIONAL 2D NEURAL NETWORK

We used convolutional 2F neural network available in keras for training and testing our model.
The overall architecture of Conv2D is shown below.
Sequential Model

Models in Keras can come in two forms – Sequential and via the Functional API. For most deep
learning networks, the Sequential model is likely. It allows to easily stack sequential layers (and
even recurrent layers) of the network in order from input to output.

The first line declares the model type as Sequential().

Adding 2D Convolutional layer

Add a 2D convolutional layer to process the 2D input images. The first argument passed to the
Conv2D() layer function is the number of output channels – in this case we have 32 output
channels. The next input is the kernel_size, which in this case we have chosen to be a 5×5
moving window, followed by the strides in the x and y directions (1, 1). Next, the activation
function is a rectified linear unit and finally we have to supply the model with the size of the
input to the layer. Declaring the input shape is only required of the first layer – Keras is good
enough to work out the size of the tensors flowing through the model from there.

Adding 2D max pooling layer

Add a 2D max pooling layer. We simply specify the size of the pooling in the x and y directions
– (2, 2) in this case, and the strides.

Adding another convolutional + max pooling layer

Next we add another convolutional + max pooling layer, with 64 output channels. The default
strides argument in the Conv2D() function is (1, 1) in Keras, so we can leave it out. The default
strides argument in Keras is to make it equal to the pool size.The input tensor for this layer is
(batch_size, 28, 28, 32) – the 28 x 28 is the size of the image, and the 32 is the number of output
channels from the previous layer.

Flatten and adding dense layer

Next is to flatten the output from these to enter our fully connected layers. The next two lines
declare our fully connected layers – using the Dense() layer in Keras, we specify the size – in
line with our architecture, we specify 1000 nodes, each activated by a ReLU function. The
second is our soft-max classification, or output layer, which is the size of the number of our
classes.

Training neural network

In the training model, we have to specify the loss function, or told the framework what type of
optimiser to use (i.e. gradient descent, Adam optimiser etc.).

Lass function of standard cross entropy for categorical class classification


(keras.losses.categorical_crossentropy). We use the Adam optimizer (keras.optimizers.Adam).
Finally, we can specify a metric that will be calculated when we run evaluate() on the model.

We first pass in all of our training data – in this case x_train and y_train. The next argument is
the batch size. In this case we are using a batch size of 32. Next we pass the number of training
epochs (2 in this case). The verbose flag, set to 1 here, specifies if you want detailed information
being printed in the console about the progress of the training.

RECOGNITION

Finally, we pass the validation or test data to the fit function so Keras knows what data to test the
metric against when evaluate() is run on the model. This function classify the image into multi
classes problem. The output of any one class value such as Normal, Benign, in situ carcinoma,
Invasive carcinoma is given as output.
UML DIAGRAM
USE CASE DIAGRAM

User input image


dataset

Pre-process

Apply CNN

Test images

Pre-process

Classify output
User

Cancer Prediction

Figure: Use case Diagram

The above figure represent use case diagram of proposed system, where user inputs dataset, we
pre-process dataset, the deep learning algorithm Convolutional Neural Networks is used to
generate the trained model to predict the breast cancer. The input testset are analyzed for cancer
prediction as binary problem. The actor and use case is represented. An eclipse shape represents
the use case namely input image, pre-process, apply DNN, prediction and output.
SEQUENCE DIAGRAM

Dataset Pre- Split X,Y DNN Trained


Process Train model

Input

Input

Input

create

Predict breast
cancer

Figure: Sequence Diagram

A sequence diagram shows a parallel vertical lines, different processes or objects


that live simultaneously, and as horizontal arrows, the messages exchanged
between them, in order in which they occur. The above figure represents sequence
diagram, the proposed system’s sequence of data flow is represented.
ACTIVITY DIAGRAM

Load dataset

Split dataset

Train set Test set

Train model

Breast cancer
predict

Classify output

Figure: Activity diagram

The above figure represent activity diagram of proposed system. The figure shows complete flow
of activity from dataset loading and all sequence of module.
COLLABORATION DIAGRAM

1:input 2:steps
Input Preprocess X,Y Train
Dataset Split

3:Input

DNN

4:Train

5:input 2:input
Test Input Pre- Trained
process model

5:recognize

classify Breast cancer


predict
output
6:Get

Figure: Collaboration Diagram

The above figure shows the collaboration diagram of the proposed system, where we represented
the collaboration between the actor and function modules with sequence number.
DEPLOYMENT DIAGRAM

Client
Middleware
Dataset Pre-process

Split X,Y DNN


Train

Server

Output Trainedmodel

Figure: Deployment Diagram

In the deployment diagram the UML models the physical deployment of artifacts on nodes. The


nodes appear as boxes, and the artifacts allocated to each node appear as rectangles within the
boxes. Nodes may have subnodes, which appear as nested boxes. A single node in a deployment
diagram may conceptually represent multiple physical nodes, such as a cluster of database
servers.

Das könnte Ihnen auch gefallen