Beruflich Dokumente
Kultur Dokumente
Dataset Information About the Dataset :- The Haberman's Survival Dataset contains cases from a study that was
conducted between 1958 and 1970 at the University of Chicagos Billings Hospital on the survival of patients who
had undergone surgery for breast cancer.
Attribute Information:
4.Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died within 5 year
Objective: To predict whether the patient will survive after 5 years or not based upon the patient's age, year of
treatment and the number of positive lymph nodes
haberman=pd.read_csv("haberman.csv")
In [27]: # haberman.describe()
print(haberman.head(15))
In [28]: print(haberman.shape)
(306, 4)
In [29]: haberman.describe()
Out[29]:
age operation_year axil_nodes survival_status
In [30]: haberman['survival_status'].value_counts()
Out[30]: 1 225
2 81
Name: survival_status, dtype: int64
In [31]: haberman.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
operation_year 306 non-null int64
axil_nodes 306 non-null int64
survival_status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB
OBSERVATION 1.Dataset is unbalanced but complete no data is missing. 2 Our class label i.e; survival_status is
INTERGER and needs to converted to valid CATEGORICAL datatype 3.Class label "survival_status" are now to
labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived
1. NUMBER OF FEATURES
In [14]: print(haberman.columns)
In [16]: print(haberman.columns[:-1])
In [33]: print(haberman["survival_status"].unique())
['yes' 'no']
In [34]: print(haberman.groupby("survival_status").count())
sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="survival_status", size=6) \
.map(plt.scatter, "age", "axil_nodes") \
.add_legend();
plt.show();
OBSERVATION
1. Patients with Age < 40 and axil < 30 have higher chances of survival.
2. Patients with Age > 50 and Axil > 10 are more likely to die.
OBSERVATION According to the above figure operation year 60,61 and 68 are the years with more survival rate
1. PAIR PLOT
In [39]: plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="survival_status", size=4);
plt.show()
1. HISTOGRAM
In [3]: sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"age")\
.add_legend();
plt.show();
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [44]: sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"operation_year")\
.add_legend();
plt.show()
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
OBSERVATION
1. Operation year having range (63-66) had highest successfull survival rate
2. Operation year 60 had highest un-successfull rate
In [45]: sns.FacetGrid(haberman,hue="survival_status",size=5)\
.map(sns.distplot,"axil_nodes")\
.add_legend();
plt.show()
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\Anand\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: Use
rWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'den
sity' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [8]: ##haberman
plt.figure(figsize=(20,6))
plt.subplot(131) ##(1=no. of rows, 3= no. of columns, 1=1st figure,2,3,4 boxe
s)
counts,bin_edges=np.histogram(haberman["age"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('age')
plt.title('PDF-CDF of age for survival_status = yes')
plt.legend(['PDF-age', 'CDF-age'], loc = 5,prop={'size': 16})
plt.subplot(132)
counts,bin_edges=np.histogram(haberman["operation_year"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('operation_year')
plt.title('PDF-CDF of operation_year for survival_status = yes')
plt.legend(['PDF-operation_year', 'CDF-operation_year'], loc = 5,prop={'size':
11})
plt.subplot(133)
counts,bin_edges=np.histogram(haberman["axil_nodes"],bins=10,density=True)
pdf=counts/sum(counts)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,linewidth=3.0)
plt.plot(bin_edges[1:],cdf,linewidth=3.0)
plt.ylabel("COUNT")
plt.xlabel('axil_nodes')
plt.title('PDF-CDF of axil_nodes for Survival Status = yes')
plt.legend(['PDF-axil_nodes', 'CDF-axil_nodes'], loc = 5,prop={'size': 16})
plt.show()
In [14]: print("mean:")
print('mean of class 1 is:',np.mean(haberman["age"]))
#Mean with an outlier.
print('mean with outlier is:',np.mean(np.append(haberman["age"],2240)));
print('mean of class 2 is: ',np.mean(haberman["age"]))
print("\nStandard-dev:");
print('STD of class 1 is:',np.std(haberman["age"]))
print('STD of class 2 is:',np.std(haberman["age"]))
mean:
mean of class 1 is: 52.45751633986928
mean with outlier is: 59.583061889250814
mean of class 2 is: 52.45751633986928
Standard-dev:
STD of class 1 is: 10.78578520363183
STD of class 2 is: 10.78578520363183
OBSERVATION 1.Mean for survived and died patients are closer,but by adding outlier as 2240 in survived we
can observe the increase in mean of class 1. 2.Thus, mean can be easily corrupted by outlier. 3.The Standard
deviation of both the class are nearly same
In [15]: print("\nmedians:")
print('median of class 1 is:',np.median(haberman["age"]))
#Median with an outlier
print('median with outlier is:',np.median(np.append(haberman["age"],2240)));
print('median of class 2 is:',np.median(haberman["age"]))
print("\nQuantiles:")
print(np.percentile(haberman["age"],np.arange(0, 100, 25)))
print(np.percentile(haberman["age"],np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(haberman["age"],90))
print(np.percentile(haberman["age"],90))
print("\n85th Percentiles:")
print(np.percentile(haberman["age"],85))
print(np.percentile(haberman["age"],85))
medians:
median of class 1 is: 52.0
median with outlier is: 52.0
median of class 2 is: 52.0
Quantiles:
[30. 44. 52. 60.75]
[30. 44. 52. 60.75]
90th Percentiles:
67.0
67.0
85th Percentiles:
65.0
65.0
OBSERVATION 1.Median for survived class(1) with and without outlier is same, declaring there is no or very little
effect of outlier on median statistics.Thus, Median cannot be easily corrupted by outlier. 2.Age at Quantiles of
0%, 25%, 50%, 75% is 30, 43, 52, 60 respectively for class 1 and 34, 46, 53, 61 respectively for class 2. 3.The
90th Percentiles values for class 1 and 2 are 67.0 each. 4.The 85th Percentiles values for class 1 and 2 are 64,
65 respectively. 5.Median Absolute Deviation is different for both the classes.
In [16]: #BOX-PLOT
OBSERVATION 1.From axil_node and survival_status, we can conclude that higher the axil_node higher the
chances of data.
1. VIOLIN PLOT
1. CONTOUR PLOT
sns.jointplot(x="age",y="axil_nodes",data=haberman, kind="kde")
plt.show()
sns.jointplot(x="operation_year",y="axil_nodes",data=haberman, kind="kde")
plt.show()
OVERALL OBSERVATION 1.Dataset is UNBALANCED but complete as no values are missing 2.Our CLASS
LABEL ie survival_status is INTERGER and needs to converted to valid CATEGORICAL datatype 3.Class Label
"survival_status" are now to labelled as {1:"yes",2:"no"} stating "yes" as survived and "no" as Dead Not Survived.
4.This is Binary Classification Problem, where we need to predict whether the patient will survive after 5 years or
not based upon the patient's age, year of treatment and the number of positive lymph nodes 5.50% of the
Patients are below the age of 54. 6.Operation year having range (63-66) had highest successfull survival rate
7.Operation year 60 had highest un-successfull rate. Patients with age range 40-60 have survived the most. 8.As
we can clearly see, axil node=0 has the highest Survival rate. 9.From AXIL_NODE and SURVIVAL_STATUS, we
can conclude that higher the axil_nodes, higher the chances of their death. 10.As we can see all the above Pair
Plots, we can say that they are not Linearly Separable. 11.Patients with Age < 40 and axil < 30 have higher
chances of survival. 12.Patients with Age > 50 and Axil > 10 are more likely to die 13.People with axil nodes
more than 50 have higher rate of non survival. 14.Operation year 60, 61 and 68 has more survival rate.