First Binary Classification Model First Binary Classification Model Data_Final Project.xlsx
Data_Final Project.xlsx
You work for a bank as a business data analyst in the credit card risk-odelin! You work for a bank as a business data analyst in the credit card risk-odelin! de"artent. Your bank conducted a bold ex"erient three years a!o# for a sin!le day de"artent. Your bank conducted a bold ex"erient three years a!o# for a sin!le day it $uietly issued credit cards to e%eryone who a""lied& re!ardless of their credit it $uietly issued credit cards to e%eryone who a""lied& re!ardless of their credit risk& until the bank had issued '(( cards without screenin! a""licants.
risk& until the bank had issued '(( cards without screenin! a""licants.
)fter three years& *+(& or ,+& of those card reci"ients defaulted# they failed to )fter three years& *+(& or ,+& of those card reci"ients defaulted# they failed to "ay back at least soe of the oney they owed. owe%er& the bank collected %ery "ay back at least soe of the oney they owed. owe%er& the bank collected %ery %aluable "ro"rietary data that it can now use to o"tii/e its future card-issuin! %aluable "ro"rietary data that it can now use to o"tii/e its future card-issuin! "rocess.
"rocess.
0he bank initially collected six "ieces of data about each "erson# 0he bank initially collected six "ieces of data about each "erson#
)!e )!e
� �
Years at current e"loyer Years at current e"loyer
� �
Years at current address Years at current address
� �
1ncoe o%er the "ast year 1ncoe o%er the "ast year
� �
Current credit card debt& and Current credit card debt& and
� �
Current autoobile debt Current autoobile debt
� �
1n addition& the bank now has a binary outcoe# default 2 *& and no default 2 (. 1n addition& the bank now has a binary outcoe# default 2 *& and no default 2 (. Your first assi!nent is to analy/e the data and create a binary classification Your first assi!nent is to analy/e the data and create a binary classification odel to forecast future defaults.
odel to forecast future defaults. You wi
You will cobll cobine daine data frota fro the the abo%e abo%e six insix in"uts to "uts to out"ut out"ut a sin!a sin!le le score. � � score. 3se the�� 3se the 4oldier Perforance s"readsheet for a si"le exa"le of cobinin! ulti"le in"uts. 4oldier Perforance s"readsheet for a si"le exa"le of cobinin! ulti"le in"uts. Forecastin! 4oldier Perforance.xlsx
Forecastin! 4oldier Perforance.xlsx 0he relati
0he relati%e rank-or%e rank-orderin! of scoderin! of scores will deteres will deterine the odrine the odelel s effecti%e�� s effecti%eness. Forness. For con%enience-- in "articular& so that you can use the )3C Calculator con%enience-- in "articular& so that you can use the )3C Calculator 4"readsheet--you are asked to use a scale for 4"readsheet--your score that has a axiu 5 6.+ and a iniu you are asked to use a scale for your score that has a axiu 5 6.+ and a iniu 7 -6.+.
7 -6.+.
)t first you ar
)t first you are not told what yoe not told what your bankur bank s own best esti��s own best estiate for its coate for its cost "er Falsest "er False 8e!ati%e 9acce"ted a""licant who becoes a defaultin! custoer: and False Positi%e 8e!ati%e 9acce"ted a""licant who becoes a defaultin! custoer: and False Positi%e 9rejected custoer who would not ha%e defaulted: classification.
9rejected custoer who would not ha%e defaulted: classification.
0herefore& the best you can do is to desi!n your odel to axii/e the )rea 3nder 0herefore& the best you can do is to desi!n your odel to axii/e the )rea 3nder the ;<C Cur%e& or )3C.
the ;<C Cur%e& or )3C. You ar
You are told te told that if yhat if your oour odel is edel is effeffecti%cti%e 9e 9 hi!h e� � hi!h enou!nou!h h )3C& n�� )3C& not defot definedined further:
further: and and robust � � robust 9a!ain �� 9a!ain not dnot defined& efined& but ibut in !enn !eneral eral this this eans eans relati%relati%elyely little decrease in )3C across ulti"le sets of new data: then it ay be ado"ted by little decrease in )3C across ulti"le sets of new data: then it ay be ado"ted by the bank as its "redicti%e odel for default& to deterine which future a""licants the bank as its "redicti%e odel for default& to deterine which future a""licants will be issued credit cards.
will be issued credit cards. You a
You are fire first !rst !i%en i%en a a 0rainin� � 0rainin! 4et ! 4et of ,(�� of ,(( out ( out of thof the '(( e '(( "eo"le "eo"le in thin thee
ex"erient. 0he Data_For_Final_Project 9below: has both the trainin! set and test ex"erient. 0he Data_For_Final_Project 9below: has both the trainin! set and test set you will need.
set you will need.
Desi!n your odel usin! the 0rainin! 4et. 4tandardi/ed %ersions of the in"ut data Desi!n your odel usin! the 0rainin! 4et. 4tandardi/ed %ersions of the in"ut data
also "ro%ided for your con%enience. You ay cobine the six in"uts by addin! the to& or subtractin! the fro& each other& takin! si"le ratios& etc. =xclude in"uts that are not hel"ful and then ex"erient with how to cobine the ost inforati%e in"uts.
8ote that will need soe of your $ui/ answers a!ain later& so "lease write the down and kee" track of the as you !o alon!.
>uestion# ?hat is your odel@ Ai%e it as a function of the two or ore of the six in"uts. For exa"le# 9)!e Years at Current )ddress:1ncoe not a !reat odelE. Your odel should ha%e at least two in"uts.
* r
?hat is your odel s )3C on the 0rainin! 4et@ 3se two di!its to the ri!ht of the�
decial "lace. *, x
' x .G r
9999Hess than .+ is not correct - you need to ake the hi!hest %alue the lowest by di%idin! by -*.
.+ has no "redicti%e %alue.
.I or hi!her is too !ood to be trueE::::
1nitial )ssessent for <%er-fittin! 9testin! your odel on new data:
8ext test your odel& without chan!in! any "araeters& on the 0est 4et of ,(( additional a""licants. 4ee the 0est 4et s"readsheet. 1t is "art of the
Data_For_Final_Project 9below: and has both the trainin! and test set. Data_Final Project.xlsx
int# Make and use a second co"y of the )3C Calculator 4"readsheet so that you can co"are 0est 4et and 0rainin! 4et results easily.
)3C_Calculator and ;e%iew of )3C Cur%e.xlsx
?hat is your odel s new )3C on the 0est 4et@ Ai%e two di!its to the ri!ht of the�
decial "lace. *, x
6' x .6 x .J r
99995.+ is not %alid - ulti"ly by -* .+ eans no "redicti%e %alue
7 .I( is too !ood to be trueE:::::
8ow that you ha%e& ho"efully& de%elo"ed your odel to the "oint where it is relati%ely robust across the trainin! set and test set& your boss at the bank� �
finally !i%es you its current rou!h estiate of the bank s a%era!e costs for each�
ty"e of classification error.
8ote that all bank odels here include only "rofits and losses within three years of when a card is issued& so the i"act of out-years 9years beyond 6: can be
i!nored.
Cost Per False 8e!ati%e# K+((( Cost Per False Positi%e# K,+((
For the '(( indi%iduals that were autoatically !i%en cards without bein!
classified& the total cost of the ex"erient turned out to be ,+L9K+(((:L'(( or KG+(&(((. 0his is K*&,+( "er e%ent.
<nly odels with lower cost "er e%ent than K*&,+( should ha%e any %alue.
>uestion# ?hat is the threshold score on the 0rainin! 4et data for your odel that inii/es Cost "er =%ent@ You will need this nuber to answer later $uestions.
int# 3sin! the )3C Calculator 4"readsheet& identify which Colun dis"lays the sae cost-"er-e%ent 9row *G: as the o%erall iniu cost-"er-e%ent shown in Cell ,. 0he threshold is shown in row *( of that Colun. ?hat the threshold eans is that at and abo%e this nuber e%erythin! is classified as a Ndefault.N
,( x *((( x 6.+ r
99990hresholds !reater than ,.+ � ay not be utili/in! the full ran!e for analysis 0hresholds less than -,.+ ( ay not be utili/in! the full ran!e for analysis::::::: Findin! the Miniu Cost Per =%ent
>uestion# )!ain referrin! only to the 0rainin! 4et data& what is the o%erall iniu cost-"er-e%ent@
int# You will need this nuber to answer later $uestions. 1f you used the )3C Calculator& the o%erall iniu cost "er e%ent will be dis"layed in Cell ,.
8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as an inte!er - no decials or dollar si!n.
For =xa"le - enter KJ((.(( as NJ((N '(( r
Co"arin! the 8ew Miniu Cost Per =%ent on 0est 4et Data
?hen you co"ared )3C for the 0rainin! and 0est 4ets& all that is necessary is to look u" the two different %alues in Cell AJ. But to !et an accurate easure of the cost-sa%in!s usin! the ori!inal odel on new data& you can not autoatically use the new threshold that results in the o%erall lowest cost-"er-e%ent on the 0est 4et.
;eeber that your odel is bein! tested for its ability to forecast - but the new o"tial threshold will be known only after the outcoes for the entire 0est 4et are known.
)ll you can use is the odel you de%elo"ed on the 0rainin! 4et data and the threshold fro the 0rainin! 4et that you should ha%e recorded when answerin! >uestion O.
>uestion# )t that sae threshold score 98<0 the threshold score that would inii/e costs for the new 0est 4et& but the old threshold score that inii/ed costs on� �
the 0rainin! 4et: what is the cost "er e%ent on the test set@
int# 3sin! the )3C Calculator 4"readsheet "re%iously "ro%ided& locate the colun on the 0rainin! 4et data that has the lowest-cost-"er e%ent. 0hat sae colun and threshold in the 0est 4et co"y of the )3C Calculator will ha%e a new
e%ent& dis"layed in row *G. 0his is alost always hi!her than the iniu cost-"er-e%ent on the 0rainin! 4et& and also hi!her than what the inial cost-"er-cost-"er-e%ent would be on the 0est 4et& if one could know the new o"tial threshold in ad%ance. 0his nuber is the actual cost "er e%ent when a""lyin! the odel-and-threshold de%elo"ed with the 0rainin! 4et to the new& 0est 4et data.
8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as an inte!er - no decials or dollar si!n.
For =xa"le - enter KJ((.(( as NJ((N ,(( x
* x *+( x G((.(( r
9999991f you find that your costs "er e%ent on the test set are uch hi!her than your costs "er e%ent on the trainin! set& consider akin! your odel si"ler �
"robably usin! fewer in"ut %ariables � as it is "robably still o%er-fittin! the trainin! set data. Probles with o%er-fittin! that are were not ob%ious at the ;<C-cur%e sta!e ay eer!e when inii/in! costs.:::::::::
Puttin! a Dollar alue on Your Model Plus the Data
)ssue your 0est 4et cost-"er-e%ent results fro >uestion ' are sustainable lon! ter.
>uestion# ow uch oney does the bank sa%e& "er e%ent& usin! your odel and its data-in"uts& instead of issuin! credit cards to e%eryone who asks@
int# the cost of issuin! credit cards to e%eryone 9no odel& no forecast: has been deterined to be ,+LK+((( 2 K*&,+( "er e%ent. Dollar %alue of the odel-"lus-data is the difference between K*&,+( and your nuber.
8ote# for Coursera to inter"ret your answer correctly you ust !i%e your answer as an inte!er - no decials or dollar si!n.
For =xa"le - enter KJ((.(( as NJ((N *(( x
,(( r
99999999952K*+( sa%in!s is a weak odel 5K*+( to 52 K,+( sa%in!s is an ok odel
5 K,+( to 52 KO+( sa%in!s is a %ery !ood odel 7KO+( sa%in!s is an excellent odel:::::::: Payback Period for Your Model
>uestion# Ai%en that it a""arently cost the bank KG+(&((( to conduct the three-year ex"erient& if the bank "rocesses *((( credit card a""licants "er day on a%era!e& how any days will it take to ensure future sa%in!s will "ay back the bankQs
initial in%estent@
Ai%e nuber rounded to the nearest day 9inte!er %alue:.
int# ulti"ly your answer to >uestion G - the cost sa%in!s "er a""licant - by *((( to !et the sa%in!s "er day.
G((((( x 6 r
999999More than a week � "oor O-G days � %ery !ood
,-6 days � excellent
* day � too !ood to be trueE:::::::::
)ny odel that is reducin! uncertainty will ha%e a 0rue Positi%e ;ate... ...=$ual to the 0est 1ncidence 9 of outcoes classified as NdefaultN: x ...Hess than the 0est 1ncidence 9 of outcoes classified as NdefaultN: x ...Areater than the 0est 1ncidence 9 of outcoes classified as NdefaultN: Ai%en that the base rate of default in the "o"ulation is ,+& any test that is reducin! uncertainty will ha%e a Positi%e Predicti%e alue 9PP:...
...=$ual to .,+ x ...Hess than .,+ x ...Areater than .,+
Ai%en that the base rate of default in the "o"ulation is ,+& any test that is reducin! uncertainty will ha%e a 8e!ati%e Predicti%e alue 98P:...
=$ual to .G+ x ...Hess than .G+ x ...Areater than .G+
Confusion Matrix Metrics. 0o deterine all "erforance etrics for a binary classification& it is sufficient to ha%e three %alues
0he Condition 1ncidence 9here the default rate of ,+:
0he "robability of 0rue Positi%es 9the 0rue Positi%e rate ulti"lied by the Condition 1ncidence:
0he 0est 1ncidence 9also called classification incidence - the su of the� � � �
"robability of 0rue Positi%es and False Positi%es:
0hese three %alues can all be obtained fro the )3C Calculator 4"readsheet and and then used as in"uts to the 1nforation Aain Calculator 4"readsheet to deterine all other "erforance etrics.
)3C_Calculator and ;e%iew of )3C Cur%e.xlsx 1nforation Aain Calculator.xlsx
>uestion# ?hat is your odel s 0rue Positi%e ;ate@�
4a%e this answer as it will be needed a!ain for Part 6 9>ui/ 6: * x
6( x .6( r
999999952 .,+ is incorrect::::::::
>uestion# ?hat is your odel s test incidence @� � �
4a%e this answer as it will be needed a!ain for Part 6 9>ui/ 6: ( x
* x
*((( x ,((.(( x
0est 1ncidences cannot be so sall that they force a hi!h false ne!ati%e rate nor lar!e that they force a hi!h false "ositi%e rate. ) "erfect test will of course ha%e a 0est 1ncidence e$ual to the Condition 1ncidence � but ost classification systes are focused on a%oidin! false ne!ati%es and ha%e a hi!her 0est 1ncidence than Condition 1ncidence.