Beruflich Dokumente
Kultur Dokumente
Data_Final Project.xlsx
You work for a bank as a business data analyst in the credit card risk-odelin!
de"artent. Your bank conducted a bold ex"erient three years a!o# for a sin!le day
it $uietly issued credit cards to e%eryone who a""lied& re!ardless of their credit
risk& until the bank had issued '(( cards without screenin! a""licants.
)fter three years& *+(& or ,+& of those card reci"ients defaulted# they failed to
"ay back at least soe of the oney they owed. owe%er& the bank collected %ery
%aluable "ro"rietary data that it can now use to o"tii/e its future card-issuin!
"rocess.
0he bank initially collected six "ieces of data about each "erson#
� )!e
1n addition& the bank now has a binary outcoe# default 2 *& and no default 2 (.
Your first assi!nent is to analy/e the data and create a binary classification
odel to forecast future defaults.
You will
will cobine
cobine data
data fro
fro the abo%e six in"uts
in"uts to out"ut a sin!le
sin!le � score.� 3se the
4oldier Perforance s"readsheet for a si"le exa"le of cobinin! ulti"le in"uts.
0herefore& the best you can do is to desi!n your odel to axii/e the )rea 3nder
the ;<C Cur%e& or )3C.
You are
are told that
that if your
your odel
odel is effe
effecti%
cti%e
e 9� hi!h enou!
enou!h
h� )3C& not
not defined
defined
further: and � robust� 9a!ain not defined&
defined& but in
in !eneral
!eneral this eans relati%ely
relati%ely
little decrease in )3C across ulti"le sets of new data: then it ay be ado"ted by
the bank as its "redicti%e odel for default& to deterine which future a""licants
will be issued credit cards.
You are
are first
first !i%en
!i%en a � 0rainin!
0rainin! 4et� of ,((
,(( out of the
the '(( "eo"le in the
the
ex"erient. 0he Data_For_Final_Project 9below: has both the trainin! set and test
set you will need.
Desi!n your odel usin! the 0rainin! 4et. 4tandardi/ed %ersions of the in"ut data
also "ro%ided for your con%enience. You ay cobine the six in"uts by addin! the
to& or subtractin! the fro& each other& takin! si"le ratios& etc. =xclude in"uts
that are not hel"ful and then ex"erient with how to cobine the ost inforati%e
in"uts.
8ote that will need soe of your $ui/ answers a!ain later& so "lease write the
down and kee" track of the as you !o alon!.
>uestion# ?hat is your odel@ Ai%e it as a function of the two or ore of the six
in"uts. For exa"le# 9)!e Years at Current )ddress:1ncoe not a !reat odelE.
?hat is your odel� s )3C on the 0rainin! 4et@ 3se two di!its to the ri!ht of the
decial "lace.
*, x
' x
.G r
9999Hess than .+ is not correct - you need to ake the hi!hest %alue the lowest by
di%idin! by -*.
8ext test your odel& without chan!in! any "araeters& on the 0est 4et of ,((
additional a""licants. 4ee the 0est 4et s"readsheet. 1t is "art of the
Data_For_Final_Project 9below: and has both the trainin! and test set.
Data_Final Project.xlsx
int# Make and use a second co"y of the )3C Calculator 4"readsheet so that you can
co"are 0est 4et and 0rainin! 4et results easily.
8ote that all bank odels here include only "rofits and losses within three years
of when a card is issued& so the i"act of out-years 9years beyond 6: can be
i!nored.
For the '(( indi%iduals that were autoatically !i%en cards without bein!
classified& the total cost of the ex"erient turned out to be ,+L9K+(((:L'(( or
KG+(&(((. 0his is K*&,+( "er e%ent.
<nly odels with lower cost "er e%ent than K*&,+( should ha%e any %alue.
>uestion# ?hat is the threshold score on the 0rainin! 4et data for your odel that
inii/es Cost "er =%ent@ You will need this nuber to answer later $uestions.
int# 3sin! the )3C Calculator 4"readsheet& identify which Colun dis"lays the sae
cost-"er-e%ent 9row *G: as the o%erall iniu cost-"er-e%ent shown in Cell ,. 0he
threshold is shown in row *( of that Colun. ?hat the threshold eans is that at
and abo%e this nuber e%erythin! is classified as a Ndefault.N
,( x
*((( x
6.+ r
99990hresholds !reater than ,.+ � ay not be utili/in! the full ran!e for analysis
0hresholds less than -,.+ ( ay not be utili/in! the full ran!e for analysis:::::::
>uestion# )!ain referrin! only to the 0rainin! 4et data& what is the o%erall
iniu cost-"er-e%ent@
int# You will need this nuber to answer later $uestions. 1f you used the )3C
Calculator& the o%erall iniu cost "er e%ent will be dis"layed in Cell ,.
8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as
an inte!er - no decials or dollar si!n.
Co"arin! the 8ew Miniu Cost Per =%ent on 0est 4et Data
?hen you co"ared )3C for the 0rainin! and 0est 4ets& all that is necessary is to
look u" the two different %alues in Cell AJ. But to !et an accurate easure of the
cost-sa%in!s usin! the ori!inal odel on new data& you can not autoatically use
the new threshold that results in the o%erall lowest cost-"er-e%ent on the 0est
4et.
;eeber that your odel is bein! tested for its ability to forecast - but the new
o"tial threshold will be known only after the outcoes for the entire 0est 4et are
known.
)ll you can use is the odel you de%elo"ed on the 0rainin! 4et data and the
threshold fro the 0rainin! 4et that you should ha%e recorded when answerin!
>uestion O.
>uestion# )t that sae threshold score 98<0 the threshold score that would inii/e
costs for the new 0est 4et& but the � old� threshold score that inii/ed costs on
the 0rainin! 4et: what is the cost "er e%ent on the test set@
int# 3sin! the )3C Calculator 4"readsheet "re%iously "ro%ided& locate the colun
on the 0rainin! 4et data that has the lowest-cost-"er e%ent. 0hat sae colun and
threshold in the 0est 4et co"y of the )3C Calculator will ha%e a new cost-"er-
e%ent& dis"layed in row *G. 0his is alost always hi!her than the iniu cost-"er-
e%ent on the 0rainin! 4et& and also hi!her than what the inial cost-"er-e%ent
would be on the 0est 4et& if one could know the new o"tial threshold in ad%ance.
0his nuber is the actual cost "er e%ent when a""lyin! the odel-and-threshold
de%elo"ed with the 0rainin! 4et to the new& 0est 4et data.
8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as
an inte!er - no decials or dollar si!n.
)ssue your 0est 4et cost-"er-e%ent results fro >uestion ' are sustainable lon!
ter.
>uestion# ow uch oney does the bank sa%e& "er e%ent& usin! your odel and its
data-in"uts& instead of issuin! credit cards to e%eryone who asks@
int# the cost of issuin! credit cards to e%eryone 9no odel& no forecast: has been
deterined to be ,+LK+((( 2 K*&,+( "er e%ent. Dollar %alue of the odel-"lus-data
is the difference between K*&,+( and your nuber.
8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as
an inte!er - no decials or dollar si!n.
,(( r
99999999952K*+( sa%in!s is a weak odel
>uestion# Ai%en that it a""arently cost the bank KG+(&((( to conduct the three-year
ex"erient& if the bank "rocesses *((( credit card a""licants "er day on a%era!e&
how any days will it take to ensure future sa%in!s will "ay back the bankQs
initial in%estent@
int# ulti"ly your answer to >uestion G - the cost sa%in!s "er a""licant - by *(((
to !et the sa%in!s "er day.
G((((( x
6 r
999999More than a week � "oor
)ny odel that is reducin! uncertainty will ha%e a 0rue Positi%e ;ate...
Ai%en that the base rate of default in the "o"ulation is ,+& any test that is
reducin! uncertainty will ha%e a Positi%e Predicti%e alue 9PP:...
...=$ual to .,+ x
...Hess than .,+ x
...Areater than .,+
Ai%en that the base rate of default in the "o"ulation is ,+& any test that is
reducin! uncertainty will ha%e a 8e!ati%e Predicti%e alue 98P:...
=$ual to .G+ x
...Hess than .G+ x
...Areater than .G+
* x
6( x
.6( r
999999952 .,+ is incorrect::::::::
( x
* x
*((( x
,((.(( x
0est 1ncidences cannot be so sall that they force a hi!h false ne!ati%e rate nor
lar!e that they force a hi!h false "ositi%e rate. ) "erfect test will of course
ha%e a 0est 1ncidence e$ual to the Condition 1ncidence � but ost classification
systes are focused on a%oidin! false ne!ati%es and ha%e a hi!her 0est 1ncidence
than Condition 1ncidence.