EDA of Haberman Survival Assignment

11/30/2019 EDA of Haberman Survival assignment
Dataset Information About the Dataset :- The Haberman's Survival Dataset contains cases from a study that was
conducted between 1958 and 1970 at the University of Chicagos Billings Hospital on the survival of patients who
had undergone surgery for breast cancer.
Dataset source :- https://www.kaggle.com/gilsousa/habermans-survival-data-set/data

(https://www.kaggle.com/gilsousa/habermans-survival-data-set/data)
Attribute Information:
1.Age of patient at time of operation (numerical)
2.Patient's year of operation (year - 1900, numerical)
3.Number of positive axillary nodes detected (numerical)
4.Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
Objective: To predict whether the patient will survive after 5 years or not based upon the patient's age, year of
treatment and the number of positive lymph nodes
1. LOADING THE DATASET
In [2]: import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
haberman=pd.read_csv("haberman.csv")
In [27]: # haberman.describe()
print(haberman.head(15))
age operation_year axil_nodes survival_status

0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
5 33 58 10 1
6 33 60 0 1
7 34 59 0 2
8 34 66 9 2
9 34 58 30 1
10 34 60 1 1
11 34 61 10 1
12 34 67 7 1
13 34 60 0 1
14 35 64 13 1
file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 1/17

In [28]: print(haberman.shape)
(306, 4)
In [29]: haberman.describe()
Out[29]:
count 306.000000 306.000000 306.000000 306.000000
mean 52.457516 62.852941 4.026144 1.264706
std 10.803452 3.249405 7.189654 0.441899
min 30.000000 58.000000 0.000000 1.000000
25% 44.000000 60.000000 0.000000 1.000000
50% 52.000000 63.000000 1.000000 1.000000
75% 60.750000 65.750000 4.000000 2.000000
max 83.000000 69.000000 52.000000 2.000000
In [30]: haberman['survival_status'].value_counts()
Out[30]: 1 225
2 81
Name: survival_status, dtype: int64
In [31]: haberman.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
operation_year 306 non-null int64
axil_nodes 306 non-null int64
survival_status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

In [32]: haberman['survival_status'] = haberman['survival_status'].map({1:"yes", 2:"no"

})
print(haberman.head(20))

0 30 64 1 yes
1 30 62 3 yes
2 30 65 0 yes
3 31 59 2 yes
4 31 65 4 yes
5 33 58 10 yes
6 33 60 0 yes
7 34 59 0 no
8 34 66 9 no
9 34 58 30 yes
10 34 60 1 yes
11 34 61 10 yes
12 34 67 7 yes
13 34 60 0 yes
14 35 64 13 yes
15 35 63 0 yes
16 36 60 1 yes
17 36 69 0 yes
18 37 60 0 yes
19 37 63 0 yes
OBSERVATION 1.Dataset is unbalanced but complete no data is missing. 2 Our class label i.e; survival_status is
INTERGER and needs to converted to valid CATEGORICAL datatype 3.Class label "survival_status" are now to
labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived
1. NUMBER OF FEATURES
In [14]: print(haberman.columns)
Index(['age', 'operation year', 'axil nodes', 'survival status'], dtype='obje

ct')
In [16]: print(haberman.columns[:-1])
Index(['age', 'operation year', 'axil nodes'], dtype='object')
In [33]: print(haberman["survival_status"].unique())
['yes' 'no']
In [34]: print(haberman.groupby("survival_status").count())
age operation_year axil_nodes

survival_status
no 81 81 81
yes 225 225 225

1. 2-D SCATTER PLOT
In [35]: haberman.plot(kind = 'scatter', x = 'operation_year', y = 'age');

plt.show
Out[35]: <function matplotlib.pyplot.show(*args, **kw)>

In [37]: ## AGE <> AXIL NODES

# haberman.plot(kind='scatter', x='age', y='axil_nodes') ;
# plt.show()
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=6) \
.map(plt.scatter, "age", "axil_nodes") \
.add_legend();
plt.show();
OBSERVATION
1. Patients with Age < 40 and axil < 30 have higher chances of survival.
2. Patients with Age > 50 and Axil > 10 are more likely to die.

In [36]: ## AGE <> OPERATION YEAR

sns.FacetGrid(haberman, hue="survival_status", size=5) \
.map(plt.scatter, "operation_year", "age") \
.add_legend();
plt.show();
OBSERVATION According to the above figure operation year 60,61 and 68 are the years with more survival rate
1. PAIR PLOT

In [39]: plt.close();
sns.pairplot(haberman, hue="survival_status", size=4);
plt.show()
1. HISTOGRAM

In [3]: sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"age")\
.add_legend();
plt.show();
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
sity' kwarg.

.map(sns.distplot,"operation_year")\
.add_legend();
plt.show()
sity' kwarg.
sity' kwarg.
OBSERVATION
1. Operation year having range (63-66) had highest successfull survival rate
2. Operation year 60 had highest un-successfull rate

.map(sns.distplot,"axil_nodes")\
.add_legend();
plt.show()
sity' kwarg.
sity' kwarg.
1. PDF AND CDF

In [8]: ##haberman
plt.figure(figsize=(20,6))
plt.subplot(131) ##(1=no. of rows, 3= no. of columns, 1=1st figure,2,3,4 boxe
s)
counts,bin_edges=np.histogram(haberman["age"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('age')
plt.title('PDF-CDF of age for survival_status = yes')
plt.legend(['PDF-age', 'CDF-age'], loc = 5,prop={'size': 16})
plt.subplot(132)
counts,bin_edges=np.histogram(haberman["operation_year"],bins=10,density=True)
cdf=np.cumsum(pdf)
plt.ylabel("COUNT")
plt.xlabel('operation_year')
plt.title('PDF-CDF of operation_year for survival_status = yes')
plt.legend(['PDF-operation_year', 'CDF-operation_year'], loc = 5,prop={'size':
11})
plt.subplot(133)
counts,bin_edges=np.histogram(haberman["axil_nodes"],bins=10,density=True)
cdf=np.cumsum(pdf)
plt.ylabel("COUNT")
plt.xlabel('axil_nodes')
plt.title('PDF-CDF of axil_nodes for Survival Status = yes')
plt.legend(['PDF-axil_nodes', 'CDF-axil_nodes'], loc = 5,prop={'size': 16})
plt.show()
1. MEAN VARIANCE and STD Dev

In [14]: print("mean:")
print('mean of class 1 is:',np.mean(haberman["age"]))
#Mean with an outlier.
print('mean with outlier is:',np.mean(np.append(haberman["age"],2240)));
print('mean of class 2 is: ',np.mean(haberman["age"]))
print("\nStandard-dev:");
print('STD of class 1 is:',np.std(haberman["age"]))
print('STD of class 2 is:',np.std(haberman["age"]))
mean:
mean of class 1 is: 52.45751633986928
mean with outlier is: 59.583061889250814
mean of class 2 is: 52.45751633986928
Standard-dev:
STD of class 1 is: 10.78578520363183
STD of class 2 is: 10.78578520363183
OBSERVATION 1.Mean for survived and died patients are closer,but by adding outlier as 2240 in survived we
can observe the increase in mean of class 1. 2.Thus, mean can be easily corrupted by outlier. 3.The Standard
deviation of both the class are nearly same

In [15]: print("\nmedians:")
print('median of class 1 is:',np.median(haberman["age"]))
#Median with an outlier
print('median with outlier is:',np.median(np.append(haberman["age"],2240)));
print('median of class 2 is:',np.median(haberman["age"]))
print("\nQuantiles:")
print(np.percentile(haberman["age"],np.arange(0, 100, 25)))
print(np.percentile(haberman["age"],np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(haberman["age"],90))
print("\n85th Percentiles:")
from statsmodels import robust

print ("\nMedian Absolute Deviation")
print(robust.mad(haberman["age"]))
print(robust.mad(haberman["age"]))
medians:
median of class 1 is: 52.0
median with outlier is: 52.0
median of class 2 is: 52.0
Quantiles:
[30. 44. 52. 60.75]
[30. 44. 52. 60.75]
90th Percentiles:
67.0
67.0
85th Percentiles:
65.0
65.0
Median Absolute Deviation

11.860817748044816
11.860817748044816
OBSERVATION 1.Median for survived class(1) with and without outlier is same, declaring there is no or very little
effect of outlier on median statistics.Thus, Median cannot be easily corrupted by outlier. 2.Age at Quantiles of
0%, 25%, 50%, 75% is 30, 43, 52, 60 respectively for class 1 and 34, 46, 53, 61 respectively for class 2. 3.The
90th Percentiles values for class 1 and 2 are 67.0 each. 4.The 85th Percentiles values for class 1 and 2 are 64,
65 respectively. 5.Median Absolute Deviation is different for both the classes.
1. BOX PLOT with WHISKERS

In [16]: #BOX-PLOT
figure, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, feature in enumerate(list(haberman.columns)[:-1]):
mystr="Box plot for survival_status and "+feature
sns.boxplot( x='survival_status', y=feature, data=haberman, ax=axes[idx]).
set_title(mystr)
plt.show()
OBSERVATION 1.From axil_node and survival_status, we can conclude that higher the axil_node higher the
chances of data.
1. VIOLIN PLOT
In [17]: fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, feature in enumerate(list(haberman.columns)[:-1]):
# print(idx,feature)
sns.violinplot( x='survival_status', y=feature, data=haberman, ax=axes[idx
])
plt.show()
1. CONTOUR PLOT

In [18]: sns.jointplot(x="age",y="operation_year",data=haberman, kind="kde")

plt.show()
sns.jointplot(x="age",y="axil_nodes",data=haberman, kind="kde")
plt.show()
sns.jointplot(x="operation_year",y="axil_nodes",data=haberman, kind="kde")
plt.show()


OVERALL OBSERVATION 1.Dataset is UNBALANCED but complete as no values are missing 2.Our CLASS
LABEL ie survival_status is INTERGER and needs to converted to valid CATEGORICAL datatype 3.Class Label
"survival_status" are now to labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived.
4.This is Binary Classification Problem, where we need to predict whether the patient will survive after 5 years or
not based upon the patient's age, year of treatment and the number of positive lymph nodes 5.50% of the
Patients are below the age of 54. 6.Operation year having range (63-66) had highest successfull survival rate
7.Operation year 60 had highest un-successfull rate. Patients with age range 40-60 have survived the most. 8.As
we can clearly see, axil node=0 has the highest Survival rate. 9.From AXIL_NODE and SURVIVAL_STATUS, we
can conclude that higher the axil_nodes, higher the chances of their death. 10.As we can see all the above Pair
Plots, we can say that they are not Linearly Separable. 11.Patients with Age < 40 and axil < 30 have higher
chances of survival. 12.Patients with Age > 50 and Axil > 10 are more likely to die 13.People with axil nodes
more than 50 have higher rate of non survival. 14.Operation year 60, 61 and 68 has more survival rate.

EDA of Haberman Survival Assignment

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

EDA of Haberman Survival Assignment

Hochgeladen von

Copyright:

Verfügbare Formate

11/30/2019 EDA of Haberman Survival assignment

Dataset source :- https://www.kaggle.com/gilsousa/habermans-survival-data-set/data

1.Age of patient at time of operation (numerical)

2.Patient's year of operation (year - 1900, numerical)

3.Number of positive axillary nodes detected (numerical)

1. LOADING THE DATASET

In [2]: import pandas as pd

age operation_year axil_nodes survival_status

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 1/17

count 306.000000 306.000000 306.000000 306.000000

mean 52.457516 62.852941 4.026144 1.264706

std 10.803452 3.249405 7.189654 0.441899

min 30.000000 58.000000 0.000000 1.000000

25% 44.000000 60.000000 0.000000 1.000000

50% 52.000000 63.000000 1.000000 1.000000

75% 60.750000 65.750000 4.000000 2.000000

max 83.000000 69.000000 52.000000 2.000000

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 2/17

In [32]: haberman['survival_status'] = haberman['survival_status'].map({1:"yes", 2:"no"

age operation_year axil_nodes survival_status

Index(['age', 'operation year', 'axil nodes', 'survival status'], dtype='obje

Index(['age', 'operation year', 'axil nodes'], dtype='object')

age operation_year axil_nodes

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 3/17

1. 2-D SCATTER PLOT

In [35]: haberman.plot(kind = 'scatter', x = 'operation_year', y = 'age');

Out[35]: <function matplotlib.pyplot.show(*args, **kw)>

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 4/17

In [37]: ## AGE <> AXIL NODES

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 5/17

In [36]: ## AGE <> OPERATION YEAR

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 6/17

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 7/17

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 8/17

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 9/17

1. PDF AND CDF

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 10/17

1. MEAN VARIANCE and STD Dev

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 11/17

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 12/17

from statsmodels import robust

Median Absolute Deviation

1. BOX PLOT with WHISKERS

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 13/17

figure, axes = plt.subplots(1, 3, figsize=(15, 5))

In [17]: fig, axes = plt.subplots(1, 3, figsize=(15, 5))

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 14/17

In [18]: sns.jointplot(x="age",y="operation_year",data=haberman, kind="kde")

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 15/17

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 16/17

file:///C:/Users/Anand/Downloads/EDA of Haberman Survival assignment.html 17/17

Das könnte Ihnen auch gefallen