• No results found

Multiple linear regression with Stata

3.10 95% prediction intervals

3.17. Multiple linear regression with Stata

The 3.12.1.Framingham.log file continues as follows and illustrates how to perform the analyses discussed in Sections 3.14, 3.15, and 3.16.

. *

. * Use multiple regression models to analyze the effects of log(sbp), . * log(bmi), age and log(scl) on log(sbp)

. *

. generate woman = sex - 1

. generate wolbmi = woman * logbmi (9 missing values generated)

. generate wolscl = woman * logscl (33 missing values generated)

. regress logsbp logbmi age logscl woman wolbmi woage wolscl 1 Output omitted. See Table 3.2

. regress logsbp logbmi age logscl woman wolbmi woage 2 Output omitted. See Table 3.2

. regress logsbp logbmi age logscl woman woage 3

Source SS df MS Number of obs = 4658

F( 5, 4652) = 318.33 4 Model 30.8663845 5 6.1732769 Prob > F = 0.0000 5 Residual 90.2160593 4652 .019392962 R-squared = 0.2549 6

Adj R-squared = 0.2541 Total 121.082444 4657 .026000095 Root MSE = .13926

logsbp Coef. Std. Err. t P>|t| [95% Conf. Interval] 7 logbmi .262647 .0137549 19.09 0.000 .2356808 .2896131 age .0035167 .0003644 9.65 0.000 .0028023 .0042311 8 logscl .0595923 .0114423 5.21 0.000 .0371599 .0820247 woman -.2165261 .0233469 -9.27 0.000 -.2622971 -.1707551 wo age .0048624 .0004988 9.75 0.000 .0038846 .0058403 cons 3.537356 .0740649 47.76 0.000 3.392153 3.682558 9 . *

. * Calculate 95% confidence and prediction intervals for a 60 . * year-old woman with a SCL of 400 and a BMI of 40.

. * . edit 10 - preserve - set obs 4700 - replace scl = 400 in 4700 - replace age = 60 in 4700 - replace bmi = 40 in 4700 - replace woman = 1 in 4700 - replace id = 9999 in 4700 -

. replace logbmi = log(bmi) if id == 9999 11

(1 real change made)

. replace logscl = log(scl) if id == 9999 (1 real change made)

. replace woage = woman*age if id == 9999 (1 real change made)

. predict yhat,xb 12

(41 missing values generated)

. label variable yhat "Expected log[BMI]"

. predict h, leverage 13

(41 missing values generated)

. predict stdyhat, stdp 14

(41 missing values generated)

. predict stdf, stdf 15

(41 missing values generated)

. generate cilyhat = yhat - invttail(4658-5-1,.025)*stdyhat 16

(41 missing values generated)

. generate ciuyhat = yhat + invttail(4658-5-1,.025)*stdyhat (41 missing values generated)

. generate cilf = yhat - invttail(4658-5-1,.025)*stdf 17

(41 missing values generated)

. generate ciuf = yhat + invttail(4658-5-1,.025)*stdf (41 missing values generated)

. generate cilsbpf = exp(cilf) 18

. generate ciusbpf = exp(ciuf) (41 missing values generated)

. list bmi age scl woman logbmi logscl yhat h stdyhat stdf 19

> cilyhat ciuyhat cilf ciuf cilsbpf ciusbpf if id==9999

4700. bmi age scl woman logbmi logscl yhat h 40 60 400 1 3.688879 5.991465 5.149496 .003901 std_yhat std_f cil_yhat ciu_yhat cil_f ciu_f .0086978 .13953 5.132444 5.166547 4.875951 5.42304 cil_sbpf ciu_sbpf 131.0987 226.5669 . display invttail(4652,.025) 1.9604741 Comments

1 This command regresses logsbp against the other covariates given in the command line. It evaluates Model (3.23). The equivalent point-and-click command isStatistics Linear models and related  Linear regres- sion Model Dependent variable:logsbp, Independent variables:

logbmi age logscl woman wo lbmi wo age wo lscl Submit .

Stata has a powerful syntax for building models with interaction that I will introduce in Section 5.23. For didactic reasons, I have chosen to introduce these models by first calculating the interaction covariates explicitly. However, the Stata syntax for these models, which I will intro- duce for logistic regression, also works for linear regression models and can appreciably reduce your programming effort.

2 This command evaluates Model (3.24). 3 This command evaluates Model (3.25).

4 The output from the regress command for multiple linear regression is similar to that for simple linear regression that was discussed in Section 2.12. The mean sum of squares due to the model (MSM) and the mean squared error (MSE) are 6.173 and 0.019 39, respectively. The F statistic for testing whether all of theβ parameters are simultaneously zero is

5 This F statistic is of overwhelming statistical significance indicating that the model covariates do affect the value of log SBP.

6 The R2statistic= MSS/TSS = 30.866/121.08 = 0.2549. Recall that the MSE equals s2and is defined by Equation (3.5). Taking the square root of this variance estimate gives the root MSE s = 0.139 26.

7 For each covariate in the model, this table gives the estimate of the associated regression coefficient, the standard error of this estimate, the t statistic for testing the null hypothesis that the true value of the parameter equals zero, the P-value that corresponds to this t statistic, and the 95% confidence interval for the coefficient estimate. The coefficient estimates in the second column of this table are also given in Table 3.2 in the second column on the right.

8 Note that although the age parameter estimate is small it is almost ten times larger than its associated standard error. Hence this estimate dif- fers from zero with high statistical significance. The large range of the age of study subjects means that the influence of age on logsbp will be appreciable even though this coefficient is small.

9 The estimate of the constant coefficientα is 3.537 356.

10 Typing edit opens the Stata Editor window (there is a button on the toolbar that does this as well). This command is similar to the browse command in that it shows the data in memory. However, unlike the

browse command, the edit command permits you to modify or enter data

in memory. We use this editor here to create a new record with covariates

scl, age, bmi, and women equal to 400, 60, 40, and 1 respectively. For

subsequent manipulation set id equal to 9999 (or any other identification number that has not already been assigned).

11 The replace command redefines those values of an existing variable for which the if command qualifier is true. In this command, logbmi is only calculated for the new patient with id= 9999. This and the following two statements define the covariates logbmi, logscl, and wo age for this patient. The equivalent point-and-click command isData Create or change variables Change contents of variable Main Variable:logsbp , New contents:log(sbp) if/in Restrict to observations If: (expression)id == 9999 Submit .

12 The variable yhat is set equal to ˆyi for each record in memory. That

is, yhat equals the estimated expected value of logsbp for each patient. This includes the new record that we have just created. Note that the regression parameter estimates are unaffected by this new record since it was created after the regress command was given.

13 The leverage option of this predict command creates a new variable called

h that equals the leverage for each patient. Note that h is defined for our

new patient even though no value of logsbp is given. This is because the leverage is a function of the covariates and does not involve the re- sponse variable. The equivalent point-and-click command isStatistics  Postestimation  Predictions, residuals, etc Main New variable name: h Produce: gr Leverage Submit . Other predict options are defined on the predict dialog box in a similar way.

14 The stdp option sets std yhat equal to the standard error of yhat, which equals shi.

15 The stdf option sets std f equal to the standard deviation of logsbp given the patient’s covariates. That is, std f = shi+ 1.

16 This command and the next define cil yhat and ciu yhat to be the lower and upper bounds of the 95% confidence interval for yhat, respectively. This interval is given by Equation (3.12). Note that there are 4658 patients in our regression and there are 5 covariates in our model. Hence the number of degrees of freedom equals 4658− 5 − 1 = 4652.

17 This command and the next define cil sbpf and ciu sbpf to be the lower and upper bounds of the 95% prediction interval for logsbp given the patient’s covariates. This interval is given by Equation (3.14).

18 This command and the next define the 95% prediction interval for the SBP of a new patient having the specified covariates. We exponentiate the prediction interval given by Equation (3.14) to obtain the interval for SBP as opposed to log[SBP].

19 This command lists the covariates and calculated values for the new patient only (that is, for observations for which id = 9999 is true). The highlighted values in the output were also calculated by hand in Section 3.16.