Risk Prediction Models - Logistic Regression Models

Logistic Regression Models

4.1 Risk Prediction Models

Let us ﬁrst consider the construction of a model in relation to one type of stimulus by analysis of whether individuals have reacted at various levels of stimulus to predict the risk of response to given levels.

4.1.1 Modeling for Proportional Data

In cases in which it is possible to observe the response to different stim-ulus levels in many individuals in terms of the proportion of the individ-uals responding, it is desirable to build a model of the correspondence between the stimulus level and the response rate, and predict the risk of response at future stimulus levels. Let us consider a case that involved the exposure to aphids on the underside of plant leaves to five insecti-cides differing in their concentration of a chemical having an insecticidal effect and find a model that expresses this insecticidal effect.

Table 4.1 Stimulus levels and the proportion of individuals responded.

Stimulus level (x) 0.4150 0.5797 0.7076 0.8865 1.0086

No. of individuals 50 48 46 49 50

Response number 6 16 24 42 44

Response rate (y) 0.120 0.333 0.522 0.857 0.880

Table 4.1 shows the number and proportion of individuals (y) that died among around 50 aphids at ﬁve incrementally increasing concentra-tions of the insecticide, taken as the stimulus levels (x), following its dis-persal on the aphids (Chatterjee et al., 1999). We will construct a model to ascertain how the response rate (lethality) changes with the incremen-tally increasing stimulus level.

In the framework of a regression model, the stimulus level x would be taken as the predictor variable and the rate y as the response variable.

The present model diﬀers from a regression model in that, although the values of the predictor variable x are a range of real numbers, those of the response variable y are limited to the range 0 < y < 1 because they represent probability ratios. Figure 4.1 shows a plot of the graduated stimulus levels shown in Table 4.1 along the x axis and the response rate along the y axis. Our objective is to construct from this data a model that outputs the y values in the range 0 to 1 for the input x values as

y = f (x), 0 < y < 1, −∞ < x < +∞. (4.1) As neither a linear model nor a polynomial model can provide a ﬁtted model that outputs risk probability values limited to the range 0 < y < 1, it is necessary to consider a new type of model.

In general, a model that outputs values of y in the interval (0, 1) for real-number inputs x, here respectively representing the response rate (y) to the stimulus level (x), is a logistic regression model of the form

y = exp(β0+ β1x)

1+ exp(β0+ β1x), 0 < y < 1, −∞ < x < +∞. (4.2) The logistic regression model is a monotone function that expresses ei-ther of two curves, as illustrated in Figure 4.2, depending on wheei-ther the coeﬃcient parameter β1of the predictor variable x is positive or negative.

It may alternatively take the form

y = 1

1+ exp{−(β0+ β1x)}. (4.3)

6WLPXOXVOHYHO[

5HVSRQVHUDWH\

Figure 4.1 Plot of the graduated stimulus levels shown in Table 4.1 along the x axis and the response rate along the y axis.

which is obtained by multiplying the denominator and the numerator in (4.2) by exp{−(β0+ β1x)}. If y is then converted as follows by what is called the logit transformation, the function becomes linear in x.

log y

1− y = β0+ β1x. (4.4)

By ﬁtting the logistic regression model to the observed data shown in Table 4.1 for the relation between the stimulus level x and the response rate y, we obtain

y = exp(−4.85 + 7.167x)

1+ exp(−4.85 + 7.167x), (4.5) (see Figure 4.3). The parameters of this model were estimated by maxi-mum likelihood, which is described in Section 4.2. As shown in the ﬁg-ure, the estimated logistic regression model (4.5) yields a value of 0.677 as the stimulus level resulting in death for 50% of the total number and 1.1 as that resulting in death for 95%.

ȕ!

Figure 4.2 Logistic functions.

6WLPXOXVOHYHO[

5HVSRQVHUDWH\

Figure 4.3 Fitting the logistic regression model to the observed data shown in Table 4.1 for the relation between the stimulus level x and the response rate y.

4.1.2 Binary Response Data

In the previous section, we fitted a logistic regression model for a group of numerous individuals at each of several stimulus levels to find the pro-portion responding to the stimulus. We next address the question of how to proceed in cases in which such a ratio cannot be obtained. For exam-ple, in modeling the relation between the level of a certain component in the blood and the incidence of a disease, in many cases it is impossi-ble to find large groups of people with the same blood level values and investigate the proportion in which the disease has occurred. What is available instead is only binary data (i.e., 0 or 1) representing occurrence or non-occurrence of the disease (the response) at particular laboratory test values (stimulus levels). For a predictor variable x that takes real-number values in this way, and with observed data in the binary form of response or non-response as the variable y, the objective of modeling becomes binary response data analysis. Let us consider the following example as a basis for discussion of risk modeling from observed binary data obtained at various stimulus levels.

In this example, we will construct a model to predict the risk of cal-cium oxalate crystals presence in the body from the speciﬁc gravity of urine, which can be readily determined by medical examination. Such crystals are a cause of kidney and ureteral stone formation. The data shown here are the measured values of urine speciﬁc gravity, taken from a report by Andrews and Herzberg (1985, p. 249) showing the measured values obtained from six tests for urine factors thought to cause crystal formation in clinical examinations yielding 77 data. A value of 0 was as-signed if no calcium oxalate crystals were found in the urine, and a value of 1 was assigned if they were found. In Section 4.2.1 below, multiple-risk model analysis is applied to the calcium oxalate crystal data obtained from the six tests.

In Figure 4.4, the data on presence and non-presence of the crystals are plotted along the vertical axis as y= 0 for the 44 individuals exhibit-ing their non-presence and y = 1 for the 33 exhibiting their presence, and the values found for their urine specific gravity are plotted along the x axis. This figure suggests, in relative terms, that the probability of calcium oxalate crystal presence tends to increase with increasing urine specific gravity. If we can model the relation between the specific grav-ity value and the probabilgrav-ity of crystal presence from this{0, 1} binary data, then it may become possible to predict the risk probability for in-dividuals, such as a prediction of 0.82 crystal presence probability in an individual having urine with a specific gravity of 1.026.

SUHVHQFH

QRQSUHVHQFH

6SHFLILFJUDYLW\

[

Figure 4.4 The data on presence and non-presence of the crystals are plotted along the vertical axis as y = 0 for the 44 individuals exhibiting their non-presence and y = 1 for the 33 exhibiting their presence. The x axis takes the values of their urine specific gravity.

In general, if we assign the values y = 1 and y = 0 to individuals showing response and non-response, respectively, at level x (stimulus levels), we then have the following n-set of data relative to the binary{0, 1} values.

(x1, y₁), (x2, y₂), · · · , (xn, yn), yi=⎧⎪⎪⎪⎨

⎪⎪⎪⎩ 1 response

0 non-response. (4.6) We will construct a model on the basis of this type of binary response data for estimation of the response probability π at given level x.

For the random variable Y representing whether response has oc-curred, the probability at stimulus level x may then be expressed as P(Y= 1|x) = π, and the non-response probability as P(Y = 0|x) = 1 − π.

As the model linking the factor x causing this response and the response probability π, let us assume the logistic regression model

π = exp(β₀+ β1x)

1+ exp(β0+ β1x), 0 < π < 1, −∞ < x < +∞. (4.7)

5LVNSUREDELOLW\\

QRQSUHVHQFH

SUHVHQFH

6SHFLILFJUDYLW\

[

Figure 4.5 The fitted logistic regression model for the 77 set of data expressing observed urine specific gravity and presence or non-presence of calcium oxalate crystals.

By model estimation based on the observed data (4.6), we can obtain a model that outputs values in the interval (0, 1) for given levels x and can thus be applied to risk prediction. For the 77 set of data shown in Fig-ure 4.4 expressing observed urine speciﬁc gravity and presence or non-presence of calcium oxalate crystals, by applying the logistic regression model using the maximum likelihood method described in Section 4.2.1, we obtain

ˆπ= exp(−142.57 + 139.71x)

1+ exp(−142.57 + 139.71x), 0 < ˆπ < 1. (4.8) The ﬁtted logistic regression model is represented by the curve shown in Figure 4.5. The probability of the presence of calcium oxalate crys-tals in the body increases with increasing urine speciﬁc gravity. This risk prediction model can be applied to probabilistic risk assessment for in-dividuals.

In document [Sadanori Konishi]Introduction to Multivariate Analysis Linear and Nonlinear Modeling(pdf){Zzzzz}.pdf (Page 114-121)