Sie sind auf Seite 1von 17

11/30/2019 EDA of Haberman Survival assignment

Dataset Information About the Dataset :- The Haberman's Survival Dataset contains cases from a study that was
conducted between 1958 and 1970 at the University of Chicagos Billings Hospital on the survival of patients who
had undergone surgery for breast cancer.

Dataset source :- https://www.kaggle.com/gilsousa/habermans-survival-data-set/data


(https://www.kaggle.com/gilsousa/habermans-survival-data-set/data)

Attribute Information:

1.Age of patient at time of operation (numerical)

2.Patient's year of operation (year - 1900, numerical)

3.Number of positive axillary nodes detected (numerical)

4.Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

Objective: To predict whether the patient will survive after 5 years or not based upon the patient's age, year of
treatment and the number of positive lymph nodes

1. LOADING THE DATASET

In [2]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

haberman=pd.read_csv("haberman.csv")

In [27]: # haberman.describe()
print(haberman.head(15))

age operation_year axil_nodes survival_status


0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
5 33 58 10 1
6 33 60 0 1
7 34 59 0 2
8 34 66 9 2
9 34 58 30 1
10 34 60 1 1
11 34 61 10 1
12 34 67 7 1
13 34 60 0 1
14 35 64 13 1

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 1/17


11/30/2019 EDA of Haberman Survival assignment

In [28]: print(haberman.shape)

(306, 4)

In [29]: haberman.describe()

Out[29]:
age operation_year axil_nodes survival_status

count 306.000000 306.000000 306.000000 306.000000

mean 52.457516 62.852941 4.026144 1.264706

std 10.803452 3.249405 7.189654 0.441899

min 30.000000 58.000000 0.000000 1.000000

25% 44.000000 60.000000 0.000000 1.000000

50% 52.000000 63.000000 1.000000 1.000000

75% 60.750000 65.750000 4.000000 2.000000

max 83.000000 69.000000 52.000000 2.000000

In [30]: haberman['survival_status'].value_counts()

Out[30]: 1 225
2 81
Name: survival_status, dtype: int64

In [31]: haberman.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
operation_year 306 non-null int64
axil_nodes 306 non-null int64
survival_status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 2/17


11/30/2019 EDA of Haberman Survival assignment

In [32]: haberman['survival_status'] = haberman['survival_status'].map({1:"yes", 2:"no"


})
print(haberman.head(20))

age operation_year axil_nodes survival_status


0 30 64 1 yes
1 30 62 3 yes
2 30 65 0 yes
3 31 59 2 yes
4 31 65 4 yes
5 33 58 10 yes
6 33 60 0 yes
7 34 59 0 no
8 34 66 9 no
9 34 58 30 yes
10 34 60 1 yes
11 34 61 10 yes
12 34 67 7 yes
13 34 60 0 yes
14 35 64 13 yes
15 35 63 0 yes
16 36 60 1 yes
17 36 69 0 yes
18 37 60 0 yes
19 37 63 0 yes

OBSERVATION 1.Dataset is unbalanced but complete no data is missing. 2 Our class label i.e; survival_status is
INTERGER and needs to converted to valid CATEGORICAL datatype 3.Class label "survival_status" are now to
labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived

1. NUMBER OF FEATURES

In [14]: print(haberman.columns)

Index(['age', 'operation year', 'axil nodes', 'survival status'], dtype='obje


ct')

In [16]: print(haberman.columns[:-1])

Index(['age', 'operation year', 'axil nodes'], dtype='object')

In [33]: print(haberman["survival_status"].unique())

['yes' 'no']

In [34]: print(haberman.groupby("survival_status").count())

age operation_year axil_nodes


survival_status
no 81 81 81
yes 225 225 225

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 3/17


11/30/2019 EDA of Haberman Survival assignment

1. 2-D SCATTER PLOT

In [35]: haberman.plot(kind = 'scatter', x = 'operation_year', y = 'age');


plt.show

Out[35]: <function matplotlib.pyplot.show(*args, **kw)>

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 4/17


11/30/2019 EDA of Haberman Survival assignment

In [37]: ## AGE <> AXIL NODES


# haberman.plot(kind='scatter', x='age', y='axil_nodes') ;
# plt.show()

sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=6) \
.map(plt.scatter, "age", "axil_nodes") \
.add_legend();
plt.show();

OBSERVATION

1. Patients with Age < 40 and axil < 30 have higher chances of survival.
2. Patients with Age > 50 and Axil > 10 are more likely to die.

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 5/17


11/30/2019 EDA of Haberman Survival assignment

In [36]: ## AGE <> OPERATION YEAR


sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "operation_year", "age") \
.add_legend();
plt.show();

OBSERVATION According to the above figure operation year 60,61 and 68 are the years with more survival rate

1. PAIR PLOT

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 6/17


11/30/2019 EDA of Haberman Survival assignment

In [39]: plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="survival_status", size=4);
plt.show()

1. HISTOGRAM

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 7/17


11/30/2019 EDA of Haberman Survival assignment

In [3]: sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"age")\
.add_legend();

plt.show();

C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 8/17


11/30/2019 EDA of Haberman Survival assignment

In [44]: sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"operation_year")\
.add_legend();
plt.show()

C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "

OBSERVATION

1. Operation year having range (63-66) had highest successfull survival rate
2. Operation year 60 had highest un-successfull rate

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 9/17


11/30/2019 EDA of Haberman Survival assignment

In [45]: sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"axil_nodes")\
.add_legend();

plt.show()

C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "

1. PDF AND CDF

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 10/17


11/30/2019 EDA of Haberman Survival assignment

In [8]: ##haberman
plt.figure(figsize=(20,6))
plt.subplot(131) ##(1=no. of rows, 3= no. of columns, 1=1st figure,2,3,4 boxe
s)
counts,bin_edges=np.histogram(haberman["age"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('age')
plt.title('PDF-CDF of age for survival_status = yes')
plt.legend(['PDF-age', 'CDF-age'], loc = 5,prop={'size': 16})

plt.subplot(132)
counts,bin_edges=np.histogram(haberman["operation_year"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('operation_year')
plt.title('PDF-CDF of operation_year for survival_status = yes')
plt.legend(['PDF-operation_year', 'CDF-operation_year'], loc = 5,prop={'size':
11})

plt.subplot(133)
counts,bin_edges=np.histogram(haberman["axil_nodes"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('axil_nodes')
plt.title('PDF-CDF of axil_nodes for Survival Status = yes')
plt.legend(['PDF-axil_nodes', 'CDF-axil_nodes'], loc = 5,prop={'size': 16})
plt.show()

1. MEAN VARIANCE and STD Dev

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 11/17


11/30/2019 EDA of Haberman Survival assignment

In [14]: print("mean:")
print('mean of class 1 is:',np.mean(haberman["age"]))
#Mean with an outlier.
print('mean with outlier is:',np.mean(np.append(haberman["age"],2240)));
print('mean of class 2 is: ',np.mean(haberman["age"]))

print("\nStandard-dev:");
print('STD of class 1 is:',np.std(haberman["age"]))
print('STD of class 2 is:',np.std(haberman["age"]))

mean:
mean of class 1 is: 52.45751633986928
mean with outlier is: 59.583061889250814
mean of class 2 is: 52.45751633986928

Standard-dev:
STD of class 1 is: 10.78578520363183
STD of class 2 is: 10.78578520363183

OBSERVATION 1.Mean for survived and died patients are closer,but by adding outlier as 2240 in survived we
can observe the increase in mean of class 1. 2.Thus, mean can be easily corrupted by outlier. 3.The Standard
deviation of both the class are nearly same

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 12/17


11/30/2019 EDA of Haberman Survival assignment

In [15]: print("\nmedians:")
print('median of class 1 is:',np.median(haberman["age"]))
#Median with an outlier
print('median with outlier is:',np.median(np.append(haberman["age"],2240)));
print('median of class 2 is:',np.median(haberman["age"]))

print("\nQuantiles:")
print(np.percentile(haberman["age"],np.arange(0, 100, 25)))
print(np.percentile(haberman["age"],np.arange(0, 100, 25)))

print("\n90th Percentiles:")
print(np.percentile(haberman["age"],90))
print(np.percentile(haberman["age"],90))

print("\n85th Percentiles:")
print(np.percentile(haberman["age"],85))
print(np.percentile(haberman["age"],85))

from statsmodels import robust


print ("\nMedian Absolute Deviation")
print(robust.mad(haberman["age"]))
print(robust.mad(haberman["age"]))

medians:
median of class 1 is: 52.0
median with outlier is: 52.0
median of class 2 is: 52.0

Quantiles:
[30. 44. 52. 60.75]
[30. 44. 52. 60.75]

90th Percentiles:
67.0
67.0

85th Percentiles:
65.0
65.0

Median Absolute Deviation


11.860817748044816
11.860817748044816

OBSERVATION 1.Median for survived class(1) with and without outlier is same, declaring there is no or very little
effect of outlier on median statistics.Thus, Median cannot be easily corrupted by outlier. 2.Age at Quantiles of
0%, 25%, 50%, 75% is 30, 43, 52, 60 respectively for class 1 and 34, 46, 53, 61 respectively for class 2. 3.The
90th Percentiles values for class 1 and 2 are 67.0 each. 4.The 85th Percentiles values for class 1 and 2 are 64,
65 respectively. 5.Median Absolute Deviation is different for both the classes.

1. BOX PLOT with WHISKERS

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 13/17


11/30/2019 EDA of Haberman Survival assignment

In [16]: #BOX-PLOT

figure, axes = plt.subplots(1, 3, figsize=(15, 5))


for idx, feature in enumerate(list(haberman.columns)[:-1]):
mystr="Box plot for survival_status and "+feature
sns.boxplot( x='survival_status', y=feature, data=haberman, ax=axes[idx]).
set_title(mystr)
plt.show()

OBSERVATION 1.From axil_node and survival_status, we can conclude that higher the axil_node higher the
chances of data.

1. VIOLIN PLOT

In [17]: fig, axes = plt.subplots(1, 3, figsize=(15, 5))


for idx, feature in enumerate(list(haberman.columns)[:-1]):
# print(idx,feature)
sns.violinplot( x='survival_status', y=feature, data=haberman, ax=axes[idx
])
plt.show()

1. CONTOUR PLOT

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 14/17


11/30/2019 EDA of Haberman Survival assignment

In [18]: sns.jointplot(x="age",y="operation_year",data=haberman, kind="kde")


plt.show()

sns.jointplot(x="age",y="axil_nodes",data=haberman, kind="kde")
plt.show()

sns.jointplot(x="operation_year",y="axil_nodes",data=haberman, kind="kde")
plt.show()

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 15/17


11/30/2019 EDA of Haberman Survival assignment

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 16/17


11/30/2019 EDA of Haberman Survival assignment

OVERALL OBSERVATION 1.Dataset is UNBALANCED but complete as no values are missing 2.Our CLASS
LABEL ie survival_status is INTERGER and needs to converted to valid CATEGORICAL datatype 3.Class Label
"survival_status" are now to labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived.
4.This is Binary Classification Problem, where we need to predict whether the patient will survive after 5 years or
not based upon the patient's age, year of treatment and the number of positive lymph nodes 5.50% of the
Patients are below the age of 54. 6.Operation year having range (63-66) had highest successfull survival rate
7.Operation year 60 had highest un-successfull rate. Patients with age range 40-60 have survived the most. 8.As
we can clearly see, axil node=0 has the highest Survival rate. 9.From AXIL_NODE and SURVIVAL_STATUS, we
can conclude that higher the axil_nodes, higher the chances of their death. 10.As we can see all the above Pair
Plots, we can say that they are not Linearly Separable. 11.Patients with Age < 40 and axil < 30 have higher
chances of survival. 12.Patients with Age > 50 and Axil > 10 are more likely to die 13.People with axil nodes
more than 50 have higher rate of non survival. 14.Operation year 60, 61 and 68 has more survival rate.

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 17/17

Das könnte Ihnen auch gefallen