1
Stepwise logistic regression Assessing the fit of the Model
ผู้ช่วยศาสตราจารย์นิคม ถนอมเสียง ภาควิชาชีวสถิติและประชากรศาสตร์ คณะสาธารณสุขศาสตร์ ม.ขอนแก่น
0 1
1/2
( )
e 1
1 ) f(-
<--- Z --->
Logistic function
) e ( 1 ) 1 f(
0 e 1
1
1 e 1
1
Model Building ในการวิเคราะห์ Logistic Regression Stepwise logistic regression
การวิเคราะห์พิจารณาคัดเลือกตัวแปรมีหลายๆ วิธี เช่น 1. พิจารณาค่า p-value ของตัวแปรที่มีความสําคัญ 2. เปรียบเทียบ reduced model กับ full Model
full model มีตัวแปรทุกตัว (ขณะทีวิเคราะห์แต่ละขั้นตอน ่ ) reduce model ตัดตัวแปรออกไป 1 ตัว
เช่น coro = b 0 + b 1 (SYSBP)+b 2 (DM) +b 3 (LDL) ->full coro = b 0 + b 1 (SYSBP)+b 2 (DM) ->reduce
Stepwise logistic regression พิจารณาค่า p-value จากวิธีการสถิติของตัวแปรที่มีความสําคัญ 1. กําหนด p-value ทีจะนําตัวแปรเข้าในสมการ ่ (Pe)
กําหนด p-value ทีจะนําตัวแปรออกจากสมการ ่ (Pr) 2. พิจารณาค่า p-value ทีมีนัยสําคัญเข้าไปในสมการก่อน ่
(พิจารณาจาก p-value ทีน้อยก่อน ่ ) p-value < Pe 3. คํานวณค่าสถิติ เลือก p-value ทีจะนําตัวแปรออก ่
p-value > Pr
4. ทําตามขั้นตอนที่ 2-3 จนไม่มีตัวแปรนําเข้า/ตัวแปรออก Hosmer & Lemeshow (2000) กําหนด
p-value for entry (Pe).15-.20 , p-value for remove (Pr) > Pe
Code Sheet for the Variables in the Low Birth Weight Study
Variable Description Codes/Values Name
1 Identification Code ID Number ID
2 Low Birth Weight 1 = BWT<=2500g, LOW
0 = BWT>2500g
3 Age of Mother Years AGE
4 Weight of Mother at Pounds LWT
Last Menstrual Period
5 Race 1 = White RACE
2 = Black 3 = Other
6 Smoking Status 0 = No, 1 = Yes SMOKE
During Pregnancy
7 History of Premature Labor 0 = None PTL
1 = One 2 = Two, etc.
8 History of Hypertension 0 = No, 1 = Yes HT 9 Presence of Uterine 0 = No, 1 = Yes UI
Irritability
10 Number of Physician Visits 0 = None, 1 = One FTV During the First Trimester 2 = Two,etc.
11 Birth Weight Grams BWT
1. กําหนด p-value ทีจะนําตัวแปรเข้าในสมการ ่ (Pe) =.20 กําหนด p-value ทีจะนําตัวแปรออกจากสมการ ่ (Pr) =.25 2. พิจารณาจาก p-value ทีน้อยก่อน ่ และ p-value < Pe
. logit low lwt
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -114.37209 Iteration 2: log likelihood = -114.34534 Iteration 3: log likelihood = -114.34533
Logistic regression Number of obs = 189 LR chi2(1) = 5.98 Prob > chi2 = 0.0145 Log likelihood = -114.34533 Pseudo R2 = 0.0255
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0140583 .0061696 -2.28 0.023 -.0261504 -.0019661 _cons | .9983135 .7852908 1.27 0.204 -.5408283 2.537455 ---
. logit low age
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -115.96259 Iteration 2: log likelihood = -115.95598 Iteration 3: log likelihood = -115.95598
Logistic regression Number of obs = 189 LR chi2(1) = 2.76 Prob > chi2 = 0.0966 Log likelihood = -115.95598 Pseudo R2 = 0.0118 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- age | -.0511529 .0315138 -1.62 0.105 -.1129188 .0106129 _cons | .3845819 .7321251 0.53 0.599 -1.050357 1.819521 ---
. logit low Irace_D2
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -116.51366 Iteration 2: log likelihood = -116.50935 Iteration 3: log likelihood = -116.50935
Logistic regression Number of obs = 189 LR chi2(1) = 1.65 Prob > chi2 = 0.1985 Log likelihood = -116.50935 Pseudo R2 = 0.0070 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- Irace_D2 | .5635762 .4325561 1.30 0.193 -.2842181 1.41137 _cons | -.8737311 .17184 -5.08 0.000 -1.210531 -.5369309 ---
2
. logit low Irace_D3
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -116.45064 Iteration 2: log likelihood = -116.44906 Iteration 3: log likelihood = -116.44906
Logistic regression Number of obs = 189 LR chi2(1) = 1.77 Prob > chi2 = 0.1829 Log likelihood = -116.44906 Pseudo R2 = 0.0076 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- Irace_D3 | .4321825 .3233959 1.34 0.181 -.2016619 1.066027 _cons | -.9509763 .2019292 -4.71 0.000 -1.34675 -.5552023 ---
. logit low ftv
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -116.95056 Iteration 2: log likelihood = -116.94943 Iteration 3: log likelihood = -116.94943
Logistic regression Number of obs = 189 LR chi2(1) = 0.77 Prob > chi2 = 0.3792 Log likelihood = -116.94943 Pseudo R2 = 0.0033 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- ftv | -.1351199 .1566986 -0.86 0.389 -.4422435 .1720037 _cons | -.6867585 .1948119 -3.53 0.000 -1.068583 -.3049343 ---
3. คํานวณค่าสถิติ พิจารณา p-value ทีจะนําตัวแปรออก ่ (p-value > Pr )
-ตัวแปร lwt มี p-value = .023 < Pr คงไว้ในโมเดล
. xi: logit low lwt
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -114.41626 Iteration 2: log likelihood = -114.34546 Iteration 3: log likelihood = -114.34533
Logit estimates Number of obs = 189
LR chi2(1) = 5.98
Prob > chi2 = 0.0145
Log likelihood = -114.34533 Pseudo R2 = 0.0255
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0140583 .0061696 -2.28 0.023 -.0261504 -.0019661 _cons | .9983143 .7852889 1.27 0.204 -.5408235 2.537452 ---
4. พิจารณาตัวแปร p-value ทีน้อยก่อน ่ และ p-value < Pe -ตัวแปร age p-value=.218 > Pe
. xi:logit low lwt age
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -113.60317 Iteration 2: log likelihood = -113.5617 Iteration 3: log likelihood = -113.56169
Logistic regression Number of obs = 189 LR chi2(2) = 7.55 Prob > chi2 = 0.0230 Log likelihood = -113.56169 Pseudo R2 = 0.0322
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0127754 .0062112 -2.06 0.040 -.0249492 -.0006016 age | -.0397879 .0322873 -1.23 0.218 -.1030699 .023494 _cons | 1.748772 .9970965 1.75 0.079 -.2055009 3.703046 ---
. xi:logit low lwt ftv
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -114.1942 Iteration 2: log likelihood = -114.16288 Iteration 3: log likelihood = -114.16287
Logistic regression Number of obs = 189 LR chi2(2) = 6.35 Prob > chi2 = 0.0419 Log likelihood = -114.16287 Pseudo R2 = 0.0270
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0137064 .0062012 -2.21 0.027 -.0258605 -.0015523 ftv | -.0977257 .1635011 -0.60 0.550 -.4181819 .2227306 _cons | 1.02784 .7888371 1.30 0.193 -.5182523 2.573932 ---
4. พิจารณาตัวแปร p-value ทีน้อยก่อน ่ และ p-value < Pe -ตัวแปร ftv p-value=.558
. xi:logit low lwt i.race
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Iteration 0: log likelihood = -117.336
Iteration 1: log likelihood = -111.73378 Iteration 2: log likelihood = -111.62959 Iteration 3: log likelihood = -111.62955 Iteration 4: log likelihood = -111.62955
Logistic regression Number of obs = 189 LR chi2(3) = 11.41 Prob > chi2 = 0.0097 Log likelihood = -111.62955 Pseudo R2 = 0.0486 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0152231 .0064394 -2.36 0.018 -.027844 -.0026022 _Irace_2 | 1.081066 .4880522 2.22 0.027 .1245015 2.037631 _Irace_3 | .4806033 .3566737 1.35 0.178 -.2184644 1.179671 _cons | .8057535 .8451667 0.95 0.340 -.8507428 2.46225 --- . test _Irace_2 _Irace_3
( 1) [low]_Irace_2 = 0 ( 2) [low]_Irace_3 = 0
chi2( 2) = 5.40 Prob > chi2 = 0.0671
4. คํานวณค่าสถิติ พิจารณา p-value ทีจะนําตัวแปร ่ เข้า (p-value<Pe) -ตัวแปร _Irace_2 มี p-value = .027
_Irace_3 มี p-value = .178 2 ตัวแปร p-value=0.0671 < Pe นําเข้าโมเดล
. xi:logit low lwt i.race
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Iteration 0: log likelihood = -117.336
Iteration 1: log likelihood = -111.73378 Iteration 2: log likelihood = -111.62959 Iteration 3: log likelihood = -111.62955 Iteration 4: log likelihood = -111.62955
Logistic regression Number of obs = 189 LR chi2(3) = 11.41 Prob > chi2 = 0.0097 Log likelihood = -111.62955 Pseudo R2 = 0.0486 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0152231 .0064394 -2.36 0.018 -.027844 -.0026022 _Irace_2 | 1.081066 .4880522 2.22 0.027 .1245015 2.037631 _Irace_3 | .4806033 .3566737 1.35 0.178 -.2184644 1.179671 _cons | .8057535 .8451667 0.95 0.340 -.8507428 2.46225 ---
5. คํานวณค่าสถิติ พิจารณา p-value ทีจะนําตัวแปรออก ่ (p-value > Pr )
-ตัวแปร _Irace_2 มี p-value = .027 < Pr คงไว้ในโมเดล
_Irace_3 มี p-value = .178 < Pr คงไว้ในโมเดล
3
6. พิจารณาตัวแปร p-value ทีน้อยก่อน ่ และ p-value < Pe -ตัวแปร age มี p-value = .443 > Pe
. xi:logit low lwt i.race age
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Iteration 0: log likelihood = -117.336
…
Iteration 4: log likelihood = -111.33032
Logistic regression Number of obs = 189 LR chi2(4) = 12.01 Prob > chi2 = 0.0173 Log likelihood = -111.33032 Pseudo R2 = 0.0512 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0143532 .0065228 -2.20 0.028 -.0271378 -.0015687 _Irace_2 | 1.003822 .4980145 2.02 0.044 .0277315 1.979912 _Irace_3 | .4434608 .3602574 1.23 0.218 -.2626307 1.149552 age | -.0255238 .0332521 -0.77 0.443 -.0906967 .0396492 _cons | 1.306741 1.069786 1.22 0.222 -.790001 3.403483 ---
. xi:logit low lwt i.race ftv
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -111.6474 Iteration 2: log likelihood = -111.53946 Iteration 3: log likelihood = -111.53941 Iteration 4: log likelihood = -111.53941
Logistic regression Number of obs = 189 LR chi2(4) = 11.59 Prob > chi2 = 0.0206 Log likelihood = -111.53941 Pseudo R2 = 0.0494
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0149921 .0064719 -2.32 0.021 -.0276767 -.0023075 _Irace_2 | 1.072486 .4887825 2.19 0.028 .1144895 2.030482 _Irace_3 | .4620372 .359807 1.28 0.199 -.2431715 1.167246 ftv | -.0695769 .1650308 -0.42 0.673 -.3930312 .2538775 _cons | .8378754 .8518404 0.98 0.325 -.8317011 2.507452 ---
6. พิจารณาตัวแปร p-value ทีน้อยก่อน ่ และ p-value < Pe -ตัวแปร ftv p-value=.673 > Pe
9. คํานวณค่าสถิติ พิจารณา p-value ทีจะนําตัวแปรเข้าในโมเดล ่ (p-value < Pe )
-ตัวแปร age มี p-value = .443, ftv p-value=.673 > Pe ยุตินําเข้าโมเดล
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0149921 .0064719 -2.32 0.021 -.0276767 -.0023075 _Irace_2 | 1.072486 .4887825 2.19 0.028 .1144895 2.030482 _Irace_3 | .4620372 .359807 1.28 0.199 -.2431715 1.167246 ftv | -.0695769 .1650308 -0.42 0.673 -.3930312 .2538775 _cons | .8378754 .8518404 0.98 0.325 -.8317011 2.507452 --- ---
low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0143532 .0065228 -2.20 0.028 -.0271378 -.0015687 _Irace_2 | 1.003822 .4980145 2.02 0.044 .0277315 1.979912 _Irace_3 | .4434608 .3602574 1.23 0.218 -.2626307 1.149552 age | -.0255238 .0332521 -0.77 0.443 -.0906967 .0396492 _cons | 1.306741 1.069786 1.22 0.222 -.790001 3.403483 ---
ดังนั้น โมเดลที่สร้างขึ้นประกอบด้วย lwt และ race
. xi:logit low lwt i.race
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) Iteration 0: log likelihood = -117.336
Iteration 1: log likelihood = -111.73378 Iteration 2: log likelihood = -111.62959 Iteration 3: log likelihood = -111.62955 Iteration 4: log likelihood = -111.62955
Logistic regression Number of obs = 189 LR chi2(3) = 11.41 Prob > chi2 = 0.0097 Log likelihood = -111.62955 Pseudo R2 = 0.0486 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0152231 .0064394 -2.36 0.018 -.027844 -.0026022 _Irace_2 | 1.081066 .4880522 2.22 0.027 .1245015 2.037631 _Irace_3 | .4806033 .3566737 1.35 0.178 -.2184644 1.179671 _cons | .8057535 .8451667 0.95 0.340 -.8507428 2.46225 ---
. xi:sw logit low lwt age (i.race) ftv, pr(.25) pe(.20) forward i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
begin with empty model p = 0.0227 < 0.2000 adding lwt
p = 0.0671 < 0.2000 adding _Irace_2 _Irace_3
Logistic regression Number of obs = 189 LR chi2(3) = 11.41 Prob > chi2 = 0.0097 Log likelihood = -111.62955 Pseudo R2 = 0.0486
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0152231 .0064394 -2.36 0.018 -.027844 -.0026022 _Irace_2 | 1.081066 .4880522 2.22 0.027 .1245015 2.037631 _Irace_3 | .4806033 .3566737 1.35 0.178 -.2184644 1.179671 _cons | .8057535 .8451667 0.95 0.340 -.8507428 2.46225 ---
Stepwise logistic regression จาก STATA -กําหนด p-value for entry (Pe)=.20
p-value for remove (Pr) =.25
การกําหนด p-value for entry สูงหรือตํ่าเกินไป
-use more tradition level (.05) fails to identify variables known to be important ?
-higher level has disadvantage of including variables that are of questionable importance at the model building stage
(Original: Mickey & Greenland,1977:p125-137;
Cite in : Hosmer & Lemeshow (2000): p95 )
4
. xi: sw logit low age lwt i.race ftv, pr(.10) pe(.05) forward i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
begin with empty model p = 0.0227 < 0.0500 adding lwt
Logit estimates Number of obs = 189
LR chi2(1) = 5.98
Prob > chi2 = 0.0145
Log likelihood = -114.34533 Pseudo R2 = 0.0255
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0140583 .0061696 -2.28 0.023 -.0261504 -.0019661 _cons | .9983143 .7852889 1.27 0.204 -.5408235 2.537452 ---
เช่น-กําหนด p-value for entry =.05; p-value for remove =.10
Stepwise logistic regression
-การใช้ Maximum Likelihood ในการวิเคราะห์แบบ stepwise logistic regression สําหรับข้อมูลจํานวนมาก
ใช้เวลามาก
-Applied จากค่าสถิติตัวแปรแต่ละตัวแปร เช่น SAS -ใช้ค่า Score test (Pe), Wald Test (Pr) STATA -ใช้ค่า Wald Test (Pe, Pr)
SPSS -ใช้ค่า Score test (Pe), LR Test (Pr)
Stepwise logistic regression กรณี
เปรียบเทียบ reduced model กับ full Model -reduced model ให้เหลือเฉพาะโมเดลทีมีนัยสําคัญ ่ -ยกเว้นกรณีตัวแปร discrete หรือตัวแปรที่ Height order
interaction มีนัยสําคัญ
-เปรียบเทียบ reduced model กับ full Model
-ถ้า likelihood ratio test (G ) ของ reduced model และ full model ไม่แตกต่างกัน แสดงว่า reduced model good as the full model นักศึกษาค้นคว้า
xi: logit low age lwt i.race ftv
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -111.41656 Iteration 2: log likelihood = -111.28677 Iteration 3: log likelihood = -111.28645
Logit estimates Number of obs = 189
LR chi2(5) = 12.10
Prob > chi2 = 0.0335
Log likelihood = -111.28645 Pseudo R2 = 0.0516
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- age | -.023823 .0337295 -0.71 0.480 -.0899317 .0422857 lwt | -.0142446 .0065407 -2.18 0.029 -.0270641 -.0014251 _Irace_2 | 1.003898 .4978579 2.02 0.044 .0281143 1.979681 _Irace_3 | .4331084 .3622397 1.20 0.232 -.2768684 1.143085 ftv | -.0493083 .1672386 -0.29 0.768 -.3770899 .2784733 _cons | 1.295366 1.071439 1.21 0.227 -.8046157 3.395347 ---
. xi: logit low age lwt i.race ftv,or
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -111.41656 Iteration 2: log likelihood = -111.28677 Iteration 3: log likelihood = -111.28645
Logit estimates Number of obs = 189 LR chi2(5) = 12.10 Prob > chi2 = 0.0335
Log likelihood = -111.28645 Pseudo R2 = 0.0516
--- low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---+--- age | .9764586 .0329355 -0.71 0.480 .9139936 1.043193 lwt | .9858564 .0064482 -2.18 0.029 .9732989 .9985759 _Irace_2 | 2.728898 1.358603 2.02 0.044 1.028513 7.240436 _Irace_3 | 1.542043 .5585894 1.20 0.232 .7581543 3.13643 ftv | .9518876 .1591923 -0.29 0.768 .6858544 1.321111 ---
Assessing The fit of The Model
1.computation and evaluation of overall measures of fit - Pearson Chi-Square
- Hosmer-Lameshow Test - Classification Table
- Area Under the Receiver Operating Characteristic Curve (ROC)
- Examination of others measure (R 2 )
3. Logistic Regression Diagnostics
4. Assessment of fit via External validation
5
. quietly xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward . lfit
Logistic model for low, goodness-of-fit test
number of observations = 189 number of covariate patterns = 109 Pearson chi2(105) = 111.22
Prob > chi2 = 0.3204
n
i i ( i ) i ) (y i χ
Pearson1 ˆ 1 ˆ ˆ 2
2
-computation and evaluation of overall measures of fit Pearson Chi-Square
j M
j j j j
j j j
Pearson
if x x
m m
y
) ; 1 ˆ ˆ (
ˆ ) (
1
2 2
df = j-p-1
j=number of covariance patterns; p=parameter
. do "G:\cat2011\pearson_chisquare.do"
. clear
. input id y x1 id y x1
1. 1 1 5
2. 2 1 7
3. 3 1 9
4. 4 1 11
5. 5 1 11
6. 6 0 2
7. 7 0 2
8. 8 0 4
9. 9 0 6
10. 10 0 8
11. end . logit y x1 ,nolog
Logistic regression Number of obs = 10 LR chi2(1) = 5.37 Prob > chi2 = 0.0205 Log likelihood = -4.2462367 Pseudo R2 = 0.3874
--- y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- x1 | .6470744 .3861188 1.68 0.094 -.1097046 1.403853 _cons | -4.205984 2.65246 -1.59 0.113 -9.40471 .992742 ---
. lfit
Logistic model for y, goodness-of-fit test number of observations = 10 number of covariate patterns = 8
Pearson chi2(6) = 7.34 Prob > chi2 = 0.2905 . predict phat
(option pr assumed; Pr(y)) . list y x1 phat
+---+
| y x1 phat |
|---|
1. | 1 5 .2747586 | 2. | 1 7 .5801861 | 3. | 1 9 .8344758 | 4. | 1 11 .9484284 | 5. | 1 11 .9484284 | 6. | 0 2 .0515716 | 7. | 0 2 .0515716 | 8. | 0 4 .1655242 | 9. | 0 6 .4198139 | 10. | 0 8 .7252414 | +---+
. gen r2=( (y-phat)/(sqrt(phat*(1-phat))))^2 . qui sum r2 ,de
. di "Pearson Chi-Square =" r(sum) Pearson Chi-Square =7.3405042
Computation and evaluation of overall measures of fit -Hosmer-Lameshow Test:
k k
c
j k
j j k
th k
c j
i
k k
th k
k k
n y m probabilit estimated
average
decile k in patterns ariate of
number the c
y patterns ariate
c the among s coresponse of
number the o
group k in subjects of number total n
g
k n k k
k ) n k - C (o
1 1
ˆ cov
cov
1 ( 1 )
2 ˆ
well fit el H mod
0:
. lfit,group(3) table
Logistic model for y, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities) +---+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|---+---+---+---+---+---+---|
| 1 | 0.2748 | 1 | 0.5 | 3 | 3.5 | 4 |
| 2 | 0.7252 | 1 | 1.7 | 2 | 1.3 | 3 |
| 3 | 0.9484 | 3 | 2.7 | 0 | 0.3 | 3 | +---+
number of observations = 10 number of groups = 3 Hosmer-Lemeshow chi2(1) = 1.46
Prob > chi2 = 0.2275 . sort phat
. list phat y phat_gr +---+
| phat y phat_gr |
|---|
1. | .0515716 0 1 | 2. | .0515716 0 1 | 3. | .1655242 0 1 | 4. | .2747586 1 1 | 5. | .4198139 0 2 | 6. | .5801861 1 2 | 7. | .7252414 0 2 | 8. | .8344758 1 3 | 9. | .9484284 1 3 | 10.| .9484284 1 3 | +---+
ck
j k
j j k
k k
n m g
k n k k
k ) n k - C (o
1
ˆ
1 ( 1 )
2 ˆ
.1358565
4
) 2747586 . 1655242 . ) 0515716 .
* 2 ) ( 4
1
(
x
3.456574 543426 4 .543426
4
) 2747586 . 1655242 . ) 0515716 .
* 2 ) ( 4 (
1 0
n
x n
Computation and evaluation of overall measures of fit -หรือ Hosmer-Lameshow Test
H* แบ่งข้อมูลเป็น 10 ส่วนเท่าๆ กัน (ตามความน่าจะเป็น)
2
; 1
2
g df g
k e k
k ) e - o
* (
H
k
e k ความน่าจะเป็นในการเกิดเหตุการณ์ในแต่ละกลุ่มตัวแปรตาม (1,0) และตามการแบ่งความน่าจะเป็นในการเกิดเหตุการณ์
(phat)
o k จํานวนค่าสังเกตในแต่ละกลุ่มตัวแปรตาม(1,0) และ
ตามการแบ่งความน่าจะเป็น (phat)
6
Hosmer-Lameshow Test /Ho : สมการเหมาะสม
-H* แบ่งข้อมูลเป็น 10 ส่วนเท่าๆ กัน (ตามความน่าจะเป็น)
. quietly xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward . lfit, group(10) table
Logistic model for low, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
_Group _Prob _Obs_1 _Exp_1 _Obs_0 _Exp_0 _Total
1 0.1681 2 2.4 17 16.6 19
2 0.2228 4 4.2 17 16.8 21
3 0.2531 5 4.0 12 13.0 17
4 0.2708 4 5.0 15 14.0 19
5 0.2955 8 5.4 11 13.6 19
6 0.3334 6 6.1 13 12.9 19
7 0.3681 6 8.2 17 14.8 23
8 0.4078 3 5.8 12 9.2 15
9 0.4770 12 8.9 8 11.1 20
10 0.5975 9 8.9 8 8.1 17
number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.61
Prob > chi2 = 0.4728
g
k e k
k ) e - o ( H *
k
1 2
. use "H:\516701_2556\lowbwt_update.dta", clear . xi: logit low lwt i.race smoke
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) ...
Logistic regression Number of obs = 189 LR chi2(4) = 19.66 Prob > chi2 = 0.0006 Log likelihood = -107.50733 Pseudo R2 = 0.0838
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0132595 .0063102 -2.10 0.036 -.0256272 -.0008917 _Irace_2 | 1.290094 .5108751 2.53 0.012 .2887976 2.291391 _Irace_3 | .9705149 .412235 2.35 0.019 .1625492 1.778481 smoke | 1.060006 .378323 2.80 0.005 .3185065 1.801505 _cons | -.1092208 .8821091 -0.12 0.901 -1.838123 1.619681 ---
ตัวอย่างการคํานวณ Hosmer-Lameshow test (วิเคราะห์เฉพาะ ตัวแปร lwt, race, smoke)
. lfit, group(10) table
Logistic model for low, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities) +---+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|---+---+---+---+---+---+---|
| 1 | 0.1229 | 0 | 1.8 | 19 | 17.2 | 19 |
| 2 | 0.1579 | 2 | 2.7 | 17 | 16.3 | 19 |
| 3 | 0.2258 | 5 | 3.5 | 14 | 15.5 | 19 |
| 4 | 0.2934 | 7 | 5.1 | 12 | 13.9 | 19 |
| 5 | 0.3252 | 6 | 6.9 | 16 | 15.1 | 22 |
|---+---+---+---+---+---+---|
| 6 | 0.3452 | 5 | 5.4 | 11 | 10.6 | 16 |
| 7 | 0.3757 | 10 | 7.3 | 10 | 12.7 | 20 |
| 8 | 0.4017 | 8 | 7.0 | 10 | 11.0 | 18 |
| 9 | 0.4704 | 7 | 8.2 | 12 | 10.8 | 19 |
| 10 | 0.7028 | 9 | 11.1 | 9 | 6.9 | 18 | +---+
number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.35
Prob > chi2 = 0.4996
g
k e k
k ) e - o ( H *
k
1 2
. predict phat . sort phat
. xtile phat_gr = phat, nq(10) . tab phat_gr low
10 |
quantiles | Low Birth Weight
of phat | 0 1 | Total ---+---+--- 1 | 19 0 | 19 2 | 17 2 | 19 3 | 14 5 | 19 4 | 12 7 | 19 5 | 16 6 | 22 6 | 11 5 | 16 7 | 10 10 | 20 8 | 10 8 | 18 9 | 12 7 | 19 10 | 9 9 | 18 ---+---+--- Total | 130 59 | 189
g
k e k
k ) e - o ( H *
k
1 2
+---+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|---+---+---+---+---+---+---|
| 1 | 0.1229 | 0 | 1.8 | 19 | 17.2 | 19 |
| 2 | 0.1579 | 2 | 2.7 | 17 | 16.3 | 19 | . . .
| 9 | 0.4704 | 7 | 8.2 | 12 | 10.8 | 19 |
| 10 | 0.7028 | 9 | 11.1 | 9 | 6.9 | 18 | +---+
.sort phat
.list phat low phat_gr if phat_gr==1 +---+
| phat low phat_gr |
|---|
1. | .0579963 0 1 |
2. | .0673255 0 1 |
3. | .0681629 0 1 |
4. | .0707333 0 1 |
5. | .0809414 0 1 |
|---|
6. | .0860122 0 1 |
7. | .0860122 0 1 |
8. | .0870603 0 1 |
9. | .0891913 0 1 |
10. | .0970244 0 1 |
|---|
11. | .0993727 0 1 |
12. | .1029206 0 1 |
13. | .1029899 0 1 |
14. | .1042213 0 1 |
15. | .1092779 0 1 |
|---|
16. | .1176729 0 1 |
17. | .1214464 0 1 |
18. | .1228683 0 1 |
19. | .1228683 0 1 |
+---+
. predict phat . sort phat
. xtile phat_gr = phat, nq(10) . tab phat_gr low
. qui su phat if phat_gr==1 . local q1=r(sum) . local n1=r(N) . di "Exp_1 11 = " `q1' Exp_1 = 1.7940981
. di "Exp_0 10 = " `n1'-`q1' Exp_0 = 17.205902
. . .
. qui su phat if phat_gr==10 . local q10=r(sum) . local n10=r(N) . di "Exp_1 101 = " `q10' Exp_1 101 = 11.148085 . di "Exp_0 100 = " `n10'-`q10' Exp_0 100 = 6.8519148
19-1.7941
+---+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|---+---+---+---+---+---+---|
| 1 | 0.1229 | 0 | 1.8 | 19 | 17.2 | 19 |
| 2 | 0.1579 | 2 | 2.7 | 17 | 16.3 | 19 | . . .
| 9 | 0.4704 | 7 | 8.2 | 12 | 10.8 | 19 |
| 10 | 0.7028 | 9 | 11.1 | 9 | 6.9 | 18 | +---+
2 ...
. 17 ) 2 . 17 19 ( 8 . 1 ) 8 . 1 0 ( 1
2 2 2
g
k ek
k) e - o
* (
H k
Hosmer-Lameshow Test /Ho : สมการเหมาะสม
. quietly xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward . lfit, group(10) table
Logistic model for low, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
_Group _Prob _Obs_1 _Exp_1 _Obs_0 _Exp_0 _Total
1 0.1681 2 2.4 17 16.6 19
2 0.2228 4 4.2 17 16.8 21
3 0.2531 5 4.0 12 13.0 17
4 0.2708 4 5.0 15 14.0 19
5 0.2955 8 5.4 11 13.6 19
6 0.3334 6 6.1 13 12.9 19
7 0.3681 6 8.2 17 14.8 23
8 0.4078 3 5.8 12 9.2 15
9 0.4770 12 8.9 8 11.1 20
10 0.5975 9 8.9 8 8.1 17
number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 7.61
Prob > chi2 = 0.4728
7
Classification Tables
. quietly xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward . lstat
Logistic model for low
--- True ---
Classified | D ~D | Total ---+---+---
+ | 6 6 | 12
- | 53 124 | 177
---+---+---
Total | 59 130 | 189
Classified + if predicted Pr(D) >= .5 True D defined as low ~= 0
--- Sensitivity Pr( +| D) 10.17%
Specificity Pr( -|~D) 95.38%
Positive predictive value Pr( D| +) 50.00%
Negative predictive value Pr(~D| -) 70.06%
--- False + rate for true ~D Pr( +|~D) 4.62%
False - rate for true D Pr( -| D) 89.83%
False + rate for classified + Pr(~D| +) 50.00%
False - rate for classified - Pr( D| -) 29.94%
--- Correctly classified 68.78%
---
. lroc
Logistic model for low
number of observations = 189 area under ROC curve = 0.6473
Area Under the Receiver Operating Characteristic Curve (ROC)
Rule area under the ROC Curve
ROC = 0.5 no discrimination, so we might as well flip a coin 0.5 < ROC < 0.7 poor discrimination, not much
better than a coin toss 0.7 ROC < 0.8 acceptable discrimination 0.8 ROC < 0.9 excellent discrimination ROC 0.9 outstanding discrimination
- In practice it is extremely unusual to observe areas under the ROC curve greater than 0.90
- Complete separation would be required for the areas under the ROC curve more than 0.90
- When there is complete separation it is impossible to estimate coefficients of a logistic regression model
Other Summary Measure -Measures R 2
-McFadden’s Pseudo R 2 ,Efron’s Pseudo R 2 etc.
ni i n i
i i ef
y y y R
R Pseudo s Efron
1 2 1
2
2 2
) (
ˆ ) ( 1 '
0 2
2
1
' L
R L R Pseudo s
McFadden
mf
pL 0 = log likelihood for models containing only the intercept L p = log likelihood for models containing only the intercept
plus the p covariate
Nagelkerke’s R 2 (Cragg & Uhler R 2 )
ll 0 is the log likelihood of the model without regressors ll 1 the log likelihood of the full model
Likelihood ratio test n is the sample size
n 1 , n0 number of response variable (yi) are either 1 or 0 p i is probailities that predicted from logit model Hosmer & Lemeshow,(2000 p 167) - Do not Recommend routine publishing of R 2 - However ,may be helpful in model building stage.
n ll
n LR
e R e
(2 )/) / ( 2
1
0s 1
’
Nagelkerke
n n n n n n
ll
0
1ln
1
0ln
0 ln
ˆ ) 1 ln(
) 1 ˆ (
1
y
iln p
iy
ip
ill
) ( 2 ll
0ll
1LR
. xi: logit low lwt i.race
i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -111.7491 Iteration 2: log likelihood = -111.62983 Iteration 3: log likelihood = -111.62955
Logistic regression Number of obs = 189 LR chi2(3) = 11.41 Prob > chi2 = 0.0097 Log likelihood = -111.62955 Pseudo R2 = 0.0486
--- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0152231 .0064393 -2.36 0.018 -.0278439 -.0026023 _Irace_2 | 1.081066 .4880512 2.22 0.027 .1245034 2.037629 _Irace_3 | .4806033 .3566733 1.35 0.178 -.2184636 1.17967 _cons | .8057535 .8451625 0.95 0.340 -.8507345 2.462241 ---
. di 1-((-111.62955)/(-117.336)) .04863341
0 2
2
1
' L
R L R Pseudo s
McFadden
mf
p8
. xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward
…
Logistic regression Number of obs = 189 LR chi2(3) = 11.41 Prob > chi2 = 0.0097 Log likelihood = -111.62955 Pseudo R2 = 0.0486 --- low | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+--- lwt | -.0152231 .0064393 -2.36 0.018 -.0278439 -.0026023 _Irace_2 | 1.081066 .4880512 2.22 0.027 .1245034 2.037629 _Irace_3 | .4806033 .3566733 1.35 0.178 -.2184636 1.17967 _cons | .8057535 .8451625 0.95 0.340 -.8507345 2.462241 ---
. fitstat
Measures of Fit for logit of low
Log-Lik Intercept Only: -117.336 Log-Lik Full Model: -111.630 D(185): 223.259 LR(3): 11.413 Prob > LR: 0.010 McFadden's R2: 0.049 McFadden's Adj R2: 0.015 Maximum Likelihood R2: 0.059 Cragg & Uhler's R2: 0.082 McKelvey and Zavoina's R2: 0.092 Efron's R2: 0.058 Variance of y*: 3.622 Variance of error: 3.290 Count R2: 0.688 Adj Count R2: 0.000 AIC: 1.224 AIC*n: 231.259 BIC: -746.464 BIC': 4.312
Logistic Regression Diagnostics
การวินิจฉัยเป็นวิธีตรวจสอบรายข้อมูล โดยมีแนวคิดจาก -ค่าส่วนที่เหลือ (residual) ได้แก่ ค่า ,
-ค่าผลกระทบ (influence) ได้แก่
2
X
i
ii
ˆ
D
i
i
i i
h
X r
1
2 2
ˆ ) 1 ˆ (
) ˆ (
i i i
i i i
i
m
m residual y
Pearson
r
' '
'
( )
;
i i ii i
i
diagonal of matrix H v b b x X VX x
h
i i i i i i
i
h
d h h d r
D
1 1
2 2 2
2 2
) 1 ( ˆ
i i i
i
h
h r
i i i i
i i i i
m y m
y m
residual deviance d
; ˆ ) ln(
2
0
; ˆ ) 1 ln(
2
22 ' '
) 1 ( ˆ ) ˆ ˆ ( ˆ ˆ
i i i j j
i
h
h VX r
X
Logistic Regression Diagnostics - Plot
- Plot - Plot Other plots
- Plot - Plot - Plot
i i
versus
X
2ˆ
i i
versus
D ˆ
i i
versus
ˆ ˆ
i i
versus h X
2
i i
versus h
D
i i
versus h
ˆ
i
i
D
X
,
2ii
ˆ
= upper 95 th Percentile crude approximation 4
= influence diagnostic must be larger than 1
84 . 3 ) 1 ( 4 ,
02.052
X
iD
i
.use "G:\hosmer_data\logistic\uis.dta", clear .gen ndrgfp1 = ((ndrugtx+1)/10)^(-1)
.gen ndrgfp2 = ndrgfp1*log((ndrugtx+1)/10) .gen agendrgfp1 = age*ndrgfp1
.gen racesite = race*site
Logistic Regression Diagnostics (1) - Plot X
i2versus ˆ
i.xi:logit dfree age ndrgfp1 ndrgfp2 i.ivhx race treat site agendrgfp1 racesite
.predict p .predict dx, dx2
.graph twoway scatter dx p, xlabel(0(.2)1) ylabel(0(10)30)
9
xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward predict p
predict dx, dx2
graph twoway scatter dx p, xlabel(0(.2)1) ylabel(0(10)30) ///
title(FIgure I Plot Delta X^2 Versus Phat)
0102030H-L dX^2
0 .2 .4 .6 .8 1
Pr(low)
Fig. I Plot Delta X^2 Versus Phat
Logistic Regression Diagnostics (2) - Plot D
iversus ˆ
i.xi:logit dfree age ndrgfp1 ndrgfp2 i.ivhx race treat site agendrgfp1 racesite
.predict p .predict dd, dd
.graph twoway scatter dd p, xlabel
(0(.2)1)
ylabel(0 3.5 7)xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward predict p
predict dd, dd
graph twoway scatter dd p, xlabel(0(.2)1) ylabel(0 3.5 7) ///
title(Fig. II Plot Delta Di Versus Phat)
03.57H-L dD
0 .2 .4 .6 .8 1
Pr(low)
Fig. II Plot Delta Di Versus Phat
Logistic Regression Diagnostics (3) - Plot ˆ
iversus ˆ
i.xi:logit dfree age ndrgfp1 ndrgfp2 i.ivhx race treat site agendrgfp1 racesite
.predict p .predict db, db
.graph twoway scatter db p, xlabel(0(.2)1)ylabel(0.15 .3)
xi: sw logit low age lwt i.race ftv, pr(.25) pe(.20) forward predict p
predict db, db
graph twoway scatter db p, xlabel(0(.2)1) ylabel(0.15 .3) ///
title(Fig. III Plot Delta Beta Versus Phat)
.15.3Pregibon's dbeta
0 .2 .4 .6 .8 1
Pr(low)
Fig. III Plot Delta Beta Versus Phat