• No results found

Logistic Regression Models

4.2 Multiple Risk Factor Models

So far we have considered models for analysis of data representing the presence or absence of response in an individual to a single stimulus type. Many cases exist, however, in which multiple factors interact to induce risk, and so we need to model their interrelationship and predict risk probability on that basis. Typical examples include those in which we wish to (1) construct a model for analysis of how the interaction of blood pressure, cholesterol, and other factors relate to the probability of occurrence of a certain disease; (2) predict the risk in extending a loan or issuing a credit card based on age, income, family structure, years of employment, and other factors; and (3) build a model that will predict corporate risk on the basis of total assets, sales, operating profit, equity ratio, and other factors.

Let the observed data for p predictor variables (risk factors) x1, x2,

· · ·, xpand response variable y be written

(xi1, xi2, · · · , xip), yi; i= 1, 2, · · · , n

, yi=⎧⎪⎪⎪⎨

⎪⎪⎪⎩ 1 response

0 non-response,(4.9) where xi j(i= 1, 2, · · · , n, j = 1, 2, · · · , p) is the data for the j-th predictor variable of the i-th individual. If response occurs when the i-th individual is exposed to stimulus comprising p risk factors, we take the data as yi= 1, and if there is no response, we take it as yi= 0.

As in the previous section, if we use the random variable Y repre-senting response or response, then the respective response and non-response probabilities for the multiple risk factors x1, · · · , xp may be expressed as

P(Y= 1|x1, · · · , xp)= π, P(Y= 0|x1, · · · , xp)= 1 − π. (4.10) The multiple risk factors and the response probabilities are linked, more-over, by

π = exp(β0+ β1x1+ β2x2+ · · · + βpxp) 1+ exp(β0+ β1x1+ β2x2+ · · · + βpxp)

= exp(βTx)

1+ exp(βTx), (4.11)

whereβ = (β0, β1, · · · , βp)T and x= (1, x1, x2, · · · , xp)T. This is referred to as a multiple logistic regression model. By logit transformation in the

same manner as in (4.4), this becomes log π

1− π = β0+ β1x1+ β2x2+ · · · + βpxp = βTx, (4.12) which expresses the linear combination of the predictor variables rep-resenting the multiple risk factors. In the next section, we discuss the estimation of model parameters by maximum likelihood with the binary response expressed in the form of the probability distribution.

4.2.1 Model Estimation

We consider the use of maximum likelihood for parameter estimation in the logistic regression model. The first question is then how to represent a phenomenon for which the outcomes are response or non-response given in binary form in a probability distribution model. Let us begin by con-sidering the toss of a coin that has a π probability of coming up heads, which may be thought of as a value that can be determined only by an infinite number of trials. We represent the outcome of each trial by the random variable Y, which is assigned a value of 1 if the coin comes up heads and 0 if it comes up tails. We may express this as P(Y = 1) = π, P(Y = 0) = 1 − π. The random variable Y thus takes a value of either 0 or 1 and is therefore a discrete random variable with a probability distri-bution

f (y|π) = πy(1− π)1−y, y = 0, 1, (4.13) which is the Bernoulli distribution. We consider modeling based on the data in (4.9) using this distribution.

The observed data for the i-th individual consists of the possible val-ues yiof the binary response random variable Yirepresenting response or non-response for the multiple risk factors (xi1, xi2, · · · , xip). If we take πi

as the true response rate for the i-th multiple risk factors, the probability distribution of Yiis then given by

f (yii)= πyii(1− πi)1−yi, yi= 0, 1, i = 1, 2, · · · , n. (4.14) The likelihood function based on y1, y2, · · · , ynis accordingly

L(π1, π2, · · · , πn)=

n i=1

f (yii)=

n i=1

πyii(1− πi)1−yi. (4.15)

By substituting the logistic regression model (4.11), which links the

multiple risk factors that influence the response rate πi of the i-th in-dividual, into (4.15), we have the likelihood function of the (p+ 1)-dimensional parameter vectorβ func-tion for the parameter vectorβ of the logistic regression model is

(β) = log L(β) =

The optimization process with respect to unknown parameter vectorβ is nonlinear, and the equation does not have an explicit solution. The maximum likelihood estimate, ˆβ, in this case may be obtained using a numerical optimization technique, such as the Newton-Raphson method.

The first and second derivatives of the log-likelihood function (β) with respect toβ are given by

∂ (β) matrix, 1n is an n-dimensional vector, the elements of which are all 1, andΛ and Π are n × n diagonal matrices defined as

Starting from an initial value, we numerically obtain a solution using

the following update formula:

βnew = βold+

 E



−∂2old)

∂β∂βT

−1∂ (βold)

∂β . (4.20)

This update formula is referred to as Fisher’s scoring algorithm (Nelder and Wedderburn, 1972; Green and Silverman, 1994), and the (r+ 1)st estimator, ˆβ(r+1), is updated by

ˆβ(r+1)=

XTΠ(r)(In− Π(r))X−1

XTΠ(r)(In− Π(r)(r),

whereξ(r) = Xβ(r)+ {Π(r)(In− Π(r))}−1(y − Π(r)1n) andΠ(r) is an n× n diagonal matrix having πi = exp(ˆβ(r)Txi)/{1 + exp(ˆβ(r)Txi)} for the rth estimator ˆβ(r)in the i-th diagonal element. Thus, by substituting the esti-mator ˆβ determined by the numerical optimization procedure into (4.11), we have the estimated logistic regression model

ˆπ(x)= exp( ˆβTx) 1+ exp(ˆβTx)

, (4.21)

which is used to predict risk probability for the multiple risk factors.

Example 4.1 (Probability of the presence of calcium oxalate crystal) The presence of calcium oxalate crystals in the body leads to the forma-tion of kidney or uretral stones, but detecforma-tion of their presence requires a thorough medical examination. We consider the construction of a simple method of risk diagnosis with a model comprising six properties of urine that are thought to be factors in crystal formation.

The six-dimensional data that are available (Andrews and Herzberg, 1985, p. 249) are the results of thorough examination in which calcium oxalate crystals were found to be present in the urine of 33 of the 77 individuals and not present in that of 44. The six urine properties that were measured for each individual were specific gravity x1, pH x2, os-molality (mOsm) x3, conductivity x4, urea concentration x5, and cal-cium concentration (CALC) x6. The label variable Y is represented as yi = 1; i = 1, 2, · · · , 33 for the individuals in which the crystals are present and as yi= 0; i = 1, 2, · · · , 44 for those in which it was absent.

Estimating the logistic regression model in (4.11) by the maximum likelihood method yields

ˆπ(x)= exp( ˆβ0+ ˆβ1x1+ ˆβ2x2+ ˆβ3x3+ ˆβ4x4+ ˆβ5x5+ ˆβ6x6) 1+ exp(ˆβ0+ ˆβ1x1+ ˆβ2x2+ ˆβ3x3+ ˆβ4x4+ ˆβ5x5+ ˆβ6x6),

where

ˆβ0+ ˆβ1x1+ ˆβ2x2+ ˆβ3x3+ ˆβ4x4+ ˆβ5x5+ ˆβ6x6 (4.22)

= −355.34 + 355.94x1− 0.5x2+ 0.02x3− 0.43x4+ 0.03x5+ 0.78x6. If this risk prediction model is applied, for example, to the data x = (1.017, 5.74, 577, 20, 296, 4.49), then the logistic regression model (4.22) yields a value of 1.342 and the estimate of risk probability is ˆπ(x)

= exp(1.342)/{1 + exp(1.342)} = 0.794, thus indicating a fairly high probability of the presence of calcium oxalate crystals in the body.

4.2.2 Model Evaluation and Selection

In constructing a multiple logistic regression model for the modeling of multiple risk factors and risk probabilities, a key question is what risk factors to include in the model to obtain optimum risk prediction. Uti-lization of the AIC (Akaike information criterion) as defined by (5.31) in Section 5.2.2 provides an answer to this question.

The AIC is generally defined by

AIC= −2(maximum log-likelihood) + 2(no. of free parameters).

We first replace the parameter vectorβ in the log-likelihood function (4.17) with the maximum likelihood estimator ˆβ, to obtain the maxi-mum log-likelihood ( ˆβ). The number of free parameters in the model is (p+ 1), which is the number of multiple risk factors p together with one intercept. Then the AIC for evaluating the logistic regression model estimated by the maximum likelihood method is given by

AIC= −2 (ˆβ) + 2(p + 1)

(4.23)

= −2

n i=1

yiˆβTxi+ 2

n i=1

log{1 + exp(ˆβTxi)} + 2(p + 1).

We select as the optimum model the one with the combination of risk factors for which the n predictor variables yield the smallest AIC value.