Beruflich Dokumente
Kultur Dokumente
July 5, 2017
1.0.2 Data
In the dataset provided, each row represents a resume. The race column has two values, b and
w, indicating black-sounding and white-sounding. The column call has two values, 1 and 0,
indicating whether the resume received a call from employers or not.
Note that the b and w values in race are assigned randomly to the resumes when presented
to the employer.
Out[81]: 157.0
In [82]: data.head()
1
occupspecific ... compreq orgreq manuf transcom bankreal tr
0 17 ... 1.0 0.0 1.0 0.0 0.0
1 316 ... 1.0 0.0 1.0 0.0 0.0
2 19 ... 1.0 0.0 1.0 0.0 0.0
3 313 ... 1.0 0.0 1.0 0.0 0.0
4 313 ... 1.0 1.0 0.0 0.0 0.0
[5 rows x 65 columns]
1.0.3 1. What test is appropriate for this problem? Does CLT apply?
In [83]: # data size
len(data.id)
Out[83]: 4870
Out[84]: 2435
Out[85]: 2435
Since the sample size is really large for both of the group (black/white), we can apply CLT
here.
Since we are comparing 2 samples, a 2-sample t test should we consider.
2
w_data_var = w_data.call.var()
avg_var = (b_data_var + w_data_var)/2 # since n1 = n2, weighted avg var th
The mean for the callback of black race is 0.064, variance is 0.060.
The mean for the callback of white race is 0.097, variance is 0.087.
The difference between the means of calls is 0.032, variance is 0.074.
1.0.7 5. Does your analysis mean that race/name is the most important factor in callback suc-
cess? Why or why not? If not, how would you amend your analysis?
Not necessarily. The analysis above indicates that the sounding of names (race) is significant in
affecting the number of callback. However, we are still not sure about whether other variables are
also significant or whether race is the most important factor. To understand the relation between
callback and other variable, we could run a regression test (possibly LASSO) or PCA as further
analysis.
In [ ]: