Sie sind auf Seite 1von 6

First Binary Classification Model

Data_Final Project.xlsx
You work for a bank as a business data analyst in the credit card risk-odelin!
de"artent. Your bank conducted a bold ex"erient three years a!o# for a sin!le day
it $uietly issued credit cards to e%eryone who a""lied& re!ardless of their credit
risk& until the bank had issued '(( cards without screenin! a""licants.

)fter three years& *+(& or ,+& of those card reci"ients defaulted# they failed to
"ay back at least soe of the oney they owed. owe%er& the bank collected %ery
%aluable "ro"rietary data that it can now use to o"tii/e its future card-issuin!
"rocess.

0he bank initially collected six "ieces of data about each "erson#

� )!e

� Years at current e"loyer

� Years at current address

� 1ncoe o%er the "ast year

� Current credit card debt& and

� Current autoobile debt

1n addition& the bank now has a binary outcoe# default 2 *& and no default 2 (.

Your first assi!nent is to analy/e the data and create a binary classification
odel to forecast future defaults.

You will
will cobine
cobine data
data fro
fro the abo%e six in"uts
in"uts to out"ut a sin!le
sin!le � score.� 3se the
4oldier Perforance s"readsheet for a si"le exa"le of cobinin! ulti"le in"uts.

Forecastin! 4oldier Perforance.xlsx


0he relati%e
relati%e rank-orderin!
rank-orderin! of scores
scores will deterine
deterine the odel
odel� s effecti%eness.
effecti%eness. For
con%enience-- in "articular& so that you can use the )3C Calculator 4"readsheet--
you are asked to use a scale for your score that has a axiu 5 6.+ and a iniu
7 -6.+.

)t first you are


are not told what your
your bank� s own best estiate
estiate for its cost
cost "er False
8e!ati%e 9acce"ted a""licant who becoes a defaultin! custoer: and False Positi%e
9rejected custoer who would not ha%e defaulted: classification.

0herefore& the best you can do is to desi!n your odel to axii/e the )rea 3nder
the ;<C Cur%e& or )3C.

You are
are told that
that if your
your odel
odel is effe
effecti%
cti%e
e 9� hi!h enou!
enou!h
h� )3C& not
not defined
defined
further: and � robust� 9a!ain not defined&
defined& but in
in !eneral
!eneral this eans relati%ely
relati%ely
little decrease in )3C across ulti"le sets of new data: then it ay be ado"ted by
the bank as its "redicti%e odel for default& to deterine which future a""licants
will be issued credit cards.

You are
are first
first !i%en
!i%en a � 0rainin!
0rainin! 4et� of ,((
,(( out of the
the '(( "eo"le in the
the
ex"erient. 0he Data_For_Final_Project 9below: has both the trainin! set and test
set you will need.

Desi!n your odel usin! the 0rainin! 4et. 4tandardi/ed %ersions of the in"ut data
also "ro%ided for your con%enience. You ay cobine the six in"uts by addin! the
to& or subtractin! the fro& each other& takin! si"le ratios& etc. =xclude in"uts
that are not hel"ful and then ex"erient with how to cobine the ost inforati%e
in"uts.

8ote that will need soe of your $ui/ answers a!ain later& so "lease write the
down and kee" track of the as you !o alon!.

>uestion# ?hat is your odel@ Ai%e it as a function of the two or ore of the six
in"uts. For exa"le# 9)!e  Years at Current )ddress:1ncoe not a !reat odelE.

Your odel should ha%e at least two in"uts.


* r

?hat is your odel� s )3C on the 0rainin! 4et@ 3se two di!its to the ri!ht of the
decial "lace.
*, x
' x
.G r
9999Hess than .+ is not correct - you need to ake the hi!hest %alue the lowest by
di%idin! by -*.

.+ has no "redicti%e %alue.

.I or hi!her is too !ood to be trueE::::

1nitial )ssessent for <%er-fittin! 9testin! your odel on new data:

8ext test your odel& without chan!in! any "araeters& on the 0est 4et of ,((
additional a""licants. 4ee the 0est 4et s"readsheet. 1t is "art of the
Data_For_Final_Project 9below: and has both the trainin! and test set.

Data_Final Project.xlsx
int# Make and use a second co"y of the )3C Calculator 4"readsheet so that you can
co"are 0est 4et and 0rainin! 4et results easily.

)3C_Calculator and ;e%iew of )3C Cur%e.xlsx


?hat is your odel� s new )3C on the 0est 4et@ Ai%e two di!its to the ri!ht of the
decial "lace.
*, x
6' x
.6 x
.J r
99995.+ is not %alid - ulti"ly by -*

.+ eans no "redicti%e %alue

7 .I( is too !ood to be trueE:::::

Findin! the Cost-Minii/in! 0hreshold for your Model


8ow that you ha%e& ho"efully& de%elo"ed your odel to the "oint where it is
relati%ely � robust� across the trainin! set and test set& your boss at the bank
finally !i%es you its current rou!h estiate of the bank� s a%era!e costs for each
ty"e of classification error.

8ote that all bank odels here include only "rofits and losses within three years
of when a card is issued& so the i"act of out-years 9years beyond 6: can be
i!nored.

Cost Per False 8e!ati%e# K+(((

Cost Per False Positi%e# K,+((

For the '(( indi%iduals that were autoatically !i%en cards without bein!
classified& the total cost of the ex"erient turned out to be ,+L9K+(((:L'(( or
KG+(&(((. 0his is K*&,+( "er e%ent.

<nly odels with lower cost "er e%ent than K*&,+( should ha%e any %alue.

>uestion# ?hat is the threshold score on the 0rainin! 4et data for your odel that
inii/es Cost "er =%ent@ You will need this nuber to answer later $uestions.

int# 3sin! the )3C Calculator 4"readsheet& identify which Colun dis"lays the sae
cost-"er-e%ent 9row *G: as the o%erall iniu cost-"er-e%ent shown in Cell ,. 0he
threshold is shown in row *( of that Colun. ?hat the threshold eans is that at
and abo%e this nuber e%erythin! is classified as a Ndefault.N

,( x
*((( x
6.+ r
99990hresholds !reater than ,.+ � ay not be utili/in! the full ran!e for analysis

0hresholds less than -,.+ ( ay not be utili/in! the full ran!e for analysis:::::::

Findin! the Miniu Cost Per =%ent

>uestion# )!ain referrin! only to the 0rainin! 4et data& what is the o%erall
iniu cost-"er-e%ent@

int# You will need this nuber to answer later $uestions. 1f you used the )3C
Calculator& the o%erall iniu cost "er e%ent will be dis"layed in Cell ,.

8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as
an inte!er - no decials or dollar si!n.

For =xa"le - enter KJ((.(( as NJ((N


'(( r

Co"arin! the 8ew Miniu Cost Per =%ent on 0est 4et Data

?hen you co"ared )3C for the 0rainin! and 0est 4ets& all that is necessary is to
look u" the two different %alues in Cell AJ. But to !et an accurate easure of the
cost-sa%in!s usin! the ori!inal odel on new data& you can not autoatically use
the new threshold that results in the o%erall lowest cost-"er-e%ent on the 0est
4et.
;eeber that your odel is bein! tested for its ability to forecast - but the new
o"tial threshold will be known only after the outcoes for the entire 0est 4et are
known.

)ll you can use is the odel you de%elo"ed on the 0rainin! 4et data and the
threshold fro the 0rainin! 4et that you should ha%e recorded when answerin!
>uestion O.

>uestion# )t that sae threshold score 98<0 the threshold score that would inii/e
costs for the new 0est 4et& but the � old� threshold score that inii/ed costs on
the 0rainin! 4et: what is the cost "er e%ent on the test set@

int# 3sin! the )3C Calculator 4"readsheet "re%iously "ro%ided& locate the colun
on the 0rainin! 4et data that has the lowest-cost-"er e%ent. 0hat sae colun and
threshold in the 0est 4et co"y of the )3C Calculator will ha%e a new cost-"er-
e%ent& dis"layed in row *G. 0his is alost always hi!her than the iniu cost-"er-
e%ent on the 0rainin! 4et& and also hi!her than what the inial cost-"er-e%ent
would be on the 0est 4et& if one could know the new o"tial threshold in ad%ance.
0his nuber is the actual cost "er e%ent when a""lyin! the odel-and-threshold
de%elo"ed with the 0rainin! 4et to the new& 0est 4et data.

8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as
an inte!er - no decials or dollar si!n.

For =xa"le - enter KJ((.(( as NJ((N


 ,(( x
* x
*+( x
G((.(( r
9999991f you find that your costs "er e%ent on the test set are uch hi!her than
your costs "er e%ent on the trainin! set& consider akin! your odel si"ler �
"robably usin! fewer in"ut %ariables � as it is "robably still o%er-fittin! the
trainin! set data. Probles with o%er-fittin! that are were not ob%ious at the ;<C-
cur%e sta!e ay eer!e when inii/in! costs.:::::::::

Puttin! a Dollar alue on Your Model Plus the Data

)ssue your 0est 4et cost-"er-e%ent results fro >uestion ' are sustainable lon!
ter.

>uestion# ow uch oney does the bank sa%e& "er e%ent& usin! your odel and its
data-in"uts& instead of issuin! credit cards to e%eryone who asks@

int# the cost of issuin! credit cards to e%eryone 9no odel& no forecast: has been
deterined to be ,+LK+((( 2 K*&,+( "er e%ent. Dollar %alue of the odel-"lus-data
is the difference between K*&,+( and your nuber.

8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as
an inte!er - no decials or dollar si!n.

For =xa"le - enter KJ((.(( as NJ((N


*(( x

,(( r
99999999952K*+( sa%in!s is a weak odel

5K*+( to 52 K,+( sa%in!s is an ok odel


5 K,+( to 52 KO+( sa%in!s is a %ery !ood odel

7KO+( sa%in!s is an excellent odel::::::::

Payback Period for Your Model

>uestion# Ai%en that it a""arently cost the bank KG+(&((( to conduct the three-year
ex"erient& if the bank "rocesses *((( credit card a""licants "er day on a%era!e&
how any days will it take to ensure future sa%in!s will "ay back the bankQs
initial in%estent@

Ai%e nuber rounded to the nearest day 9inte!er %alue:.

int# ulti"ly your answer to >uestion G - the cost sa%in!s "er a""licant - by *(((
to !et the sa%in!s "er day.

G((((( x

6 r
999999More than a week � "oor

O-G days � %ery !ood

,-6 days � excellent

* day � too !ood to be trueE:::::::::

)ny odel that is reducin! uncertainty will ha%e a 0rue Positi%e ;ate...

...=$ual to the 0est 1ncidence 9 of outcoes classified as NdefaultN: x


...Hess than the 0est 1ncidence 9 of outcoes classified as NdefaultN: x
...Areater than the 0est 1ncidence 9 of outcoes classified as NdefaultN:

Ai%en that the base rate of default in the "o"ulation is ,+& any test that is
reducin! uncertainty will ha%e a Positi%e Predicti%e alue 9PP:...

...=$ual to .,+ x
...Hess than .,+ x
...Areater than .,+

Ai%en that the base rate of default in the "o"ulation is ,+& any test that is
reducin! uncertainty will ha%e a 8e!ati%e Predicti%e alue 98P:...

=$ual to .G+ x
...Hess than .G+ x
...Areater than .G+

Confusion Matrix Metrics. 0o deterine all "erforance etrics for a binary


classification& it is sufficient to ha%e three %alues

0he Condition 1ncidence 9here the default rate of ,+:


0he "robability of 0rue Positi%es 9the 0rue Positi%e rate ulti"lied by the
Condition 1ncidence:
0he �0est 1ncidence� 9also called � classification incidence� - the su of the
"robability of 0rue Positi%es and False Positi%es:
0hese three %alues can all be obtained fro the )3C Calculator 4"readsheet and and
then used as in"uts to the 1nforation Aain Calculator 4"readsheet to deterine all
other "erforance etrics.

)3C_Calculator and ;e%iew of )3C Cur%e.xlsx


1nforation Aain Calculator.xlsx
>uestion# ?hat is your odel� s 0rue Positi%e ;ate@

4a%e this answer as it will be needed a!ain for Part 6 9>ui/ 6:

* x
6( x
.6( r
999999952 .,+ is incorrect::::::::

>uestion# ?hat is your odel� s � test incidence� @

4a%e this answer as it will be needed a!ain for Part 6 9>ui/ 6:

( x
* x
*((( x
,((.(( x
0est 1ncidences cannot be so sall that they force a hi!h false ne!ati%e rate nor
lar!e that they force a hi!h false "ositi%e rate. ) "erfect test will of course
ha%e a 0est 1ncidence e$ual to the Condition 1ncidence � but ost classification
systes are focused on a%oidin! false ne!ati%es and ha%e a hi!her 0est 1ncidence
than Condition 1ncidence.

Das könnte Ihnen auch gefallen