Multiple Risk Factor Models - Logistic Regression Models

Logistic Regression Models

4.2 Multiple Risk Factor Models

So far we have considered models for analysis of data representing the presence or absence of response in an individual to a single stimulus type. Many cases exist, however, in which multiple factors interact to induce risk, and so we need to model their interrelationship and predict risk probability on that basis. Typical examples include those in which we wish to (1) construct a model for analysis of how the interaction of blood pressure, cholesterol, and other factors relate to the probability of occurrence of a certain disease; (2) predict the risk in extending a loan or issuing a credit card based on age, income, family structure, years of employment, and other factors; and (3) build a model that will predict corporate risk on the basis of total assets, sales, operating proﬁt, equity ratio, and other factors.

Let the observed data for p predictor variables (risk factors) x1, x2,

· · ·, xpand response variable y be written

(xi1, xi2, · · · , xip), yi; i= 1, 2, · · · , n

, yi=⎧⎪⎪⎪⎨

⎪⎪⎪⎩ 1 response

0 non-response,(4.9) where xi j(i= 1, 2, · · · , n, j = 1, 2, · · · , p) is the data for the j-th predictor variable of the i-th individual. If response occurs when the i-th individual is exposed to stimulus comprising p risk factors, we take the data as yi= 1, and if there is no response, we take it as yi= 0.

As in the previous section, if we use the random variable Y repre-senting response or response, then the respective response and non-response probabilities for the multiple risk factors x₁, · · · , xp may be expressed as

P(Y= 1|x1, · · · , xp)= π, P(Y= 0|x1, · · · , xp)= 1 − π. (4.10) The multiple risk factors and the response probabilities are linked, more-over, by

π = exp(β0+ β1x₁+ β2x₂+ · · · + βpxp) 1+ exp(β0+ β1x₁+ β2x₂+ · · · + βpxp)

= exp(β^Tx)

1+ exp(β^Tx), (4.11)

whereβ = (β0, β₁, · · · , βp)^T and x= (1, x1, x₂, · · · , xp)^T. This is referred to as a multiple logistic regression model. By logit transformation in the

same manner as in (4.4), this becomes log π

1− π = β0+ β1x₁+ β2x₂+ · · · + βpxp = β^Tx, (4.12) which expresses the linear combination of the predictor variables rep-resenting the multiple risk factors. In the next section, we discuss the estimation of model parameters by maximum likelihood with the binary response expressed in the form of the probability distribution.

4.2.1 Model Estimation

We consider the use of maximum likelihood for parameter estimation in the logistic regression model. The ﬁrst question is then how to represent a phenomenon for which the outcomes are response or non-response given in binary form in a probability distribution model. Let us begin by con-sidering the toss of a coin that has a π probability of coming up heads, which may be thought of as a value that can be determined only by an inﬁnite number of trials. We represent the outcome of each trial by the random variable Y, which is assigned a value of 1 if the coin comes up heads and 0 if it comes up tails. We may express this as P(Y = 1) = π, P(Y = 0) = 1 − π. The random variable Y thus takes a value of either 0 or 1 and is therefore a discrete random variable with a probability distri-bution

f (y|π) = π^y(1− π)¹^−y, y = 0, 1, (4.13) which is the Bernoulli distribution. We consider modeling based on the data in (4.9) using this distribution.

The observed data for the i-th individual consists of the possible val-ues yiof the binary response random variable Yirepresenting response or non-response for the multiple risk factors (xi1, xi2, · · · , xip). If we take πi

as the true response rate for the i-th multiple risk factors, the probability distribution of Yiis then given by

f (yi|πi)= π^y_iⁱ(1− πi)^1−yⁱ, yi= 0, 1, i = 1, 2, · · · , n. (4.14) The likelihood function based on y1, y₂, · · · , ynis accordingly

L(π₁, π₂, · · · , πn)=

n i=1

f (yi|πi)=

n i=1

π^y_iⁱ(1− πi)^1−yⁱ. (4.15)

By substituting the logistic regression model (4.11), which links the

multiple risk factors that inﬂuence the response rate πi of the i-th in-dividual, into (4.15), we have the likelihood function of the (p+ 1)-dimensional parameter vectorβ func-tion for the parameter vectorβ of the logistic regression model is

(β) = log L(β) =

The optimization process with respect to unknown parameter vectorβ is nonlinear, and the equation does not have an explicit solution. The maximum likelihood estimate, ˆβ, in this case may be obtained using a numerical optimization technique, such as the Newton-Raphson method.

The ﬁrst and second derivatives of the log-likelihood function (β) with respect toβ are given by

∂ (β) matrix, 1n is an n-dimensional vector, the elements of which are all 1, andΛ and Π are n × n diagonal matrices deﬁned as

Starting from an initial value, we numerically obtain a solution using

the following update formula:

β^new = β^old+

−∂² (β^old)

∂β∂β^T

−1∂ (β^old)

∂β . (4.20)

This update formula is referred to as Fisher’s scoring algorithm (Nelder and Wedderburn, 1972; Green and Silverman, 1994), and the (r+ 1)st estimator, ˆβ^(r⁺¹⁾, is updated by

ˆβ^(r+1)=

X^TΠ^(r)(In− Π^(r))X₋₁

X^TΠ^(r)(In− Π^(r))ξ^(r),

whereξ^(r) = Xβ^(r)+ {Π^(r)(In− Π^(r))}⁻¹(y − Π^(r)1n) andΠ^(r) is an n× n diagonal matrix having πi = exp(ˆβ^(r)Txi)/{1 + exp(ˆβ^(r)Txi)} for the rth estimator ˆβ^(r)in the i-th diagonal element. Thus, by substituting the esti-mator ˆβ determined by the numerical optimization procedure into (4.11), we have the estimated logistic regression model

ˆπ(x)= exp( ˆβ^Tx) 1+ exp(ˆβ^Tx)

, (4.21)

which is used to predict risk probability for the multiple risk factors.

Example 4.1 (Probability of the presence of calcium oxalate crystal) The presence of calcium oxalate crystals in the body leads to the forma-tion of kidney or uretral stones, but detecforma-tion of their presence requires a thorough medical examination. We consider the construction of a simple method of risk diagnosis with a model comprising six properties of urine that are thought to be factors in crystal formation.

The six-dimensional data that are available (Andrews and Herzberg, 1985, p. 249) are the results of thorough examination in which calcium oxalate crystals were found to be present in the urine of 33 of the 77 individuals and not present in that of 44. The six urine properties that were measured for each individual were speciﬁc gravity x₁, pH x₂, os-molality (mOsm) x₃, conductivity x₄, urea concentration x₅, and cal-cium concentration (CALC) x₆. The label variable Y is represented as y_i = 1; i = 1, 2, · · · , 33 for the individuals in which the crystals are present and as yi= 0; i = 1, 2, · · · , 44 for those in which it was absent.

Estimating the logistic regression model in (4.11) by the maximum likelihood method yields

ˆπ(x)= exp( ˆβ0+ ˆβ1x₁+ ˆβ2x₂+ ˆβ3x₃+ ˆβ4x₄+ ˆβ5x₅+ ˆβ6x₆) 1+ exp(ˆβ0+ ˆβ1x₁+ ˆβ2x₂+ ˆβ3x₃+ ˆβ4x₄+ ˆβ5x₅+ ˆβ6x₆),

where

ˆβ₀+ ˆβ1x₁+ ˆβ2x₂+ ˆβ3x₃+ ˆβ4x₄+ ˆβ5x₅+ ˆβ6x₆ (4.22)

= −355.34 + 355.94x1− 0.5x2+ 0.02x3− 0.43x4+ 0.03x5+ 0.78x6. If this risk prediction model is applied, for example, to the data x = (1.017, 5.74, 577, 20, 296, 4.49), then the logistic regression model (4.22) yields a value of 1.342 and the estimate of risk probability is ˆπ(x)

= exp(1.342)/{1 + exp(1.342)} = 0.794, thus indicating a fairly high probability of the presence of calcium oxalate crystals in the body.

4.2.2 Model Evaluation and Selection

In constructing a multiple logistic regression model for the modeling of multiple risk factors and risk probabilities, a key question is what risk factors to include in the model to obtain optimum risk prediction. Uti-lization of the AIC (Akaike information criterion) as deﬁned by (5.31) in Section 5.2.2 provides an answer to this question.

The AIC is generally deﬁned by

AIC= −2(maximum log-likelihood) + 2(no. of free parameters).

We ﬁrst replace the parameter vectorβ in the log-likelihood function (4.17) with the maximum likelihood estimator ˆβ, to obtain the maxi-mum log-likelihood ( ˆβ). The number of free parameters in the model is (p+ 1), which is the number of multiple risk factors p together with one intercept. Then the AIC for evaluating the logistic regression model estimated by the maximum likelihood method is given by

AIC= −2 (ˆβ) + 2(p + 1)

(4.23)

= −2

n i=1

yiˆβ^Txi+ 2

n i=1

log{1 + exp(ˆβ^Txi)} + 2(p + 1).

We select as the optimum model the one with the combination of risk factors for which the n predictor variables yield the smallest AIC value.

In document [Sadanori Konishi]Introduction to Multivariate Analysis Linear and Nonlinear Modeling(pdf){Zzzzz}.pdf (Page 121-125)