Regression with Indicator Variables - MULTIPLE LINEAR REGRESSION

Regression Analysis

6.3 MULTIPLE LINEAR REGRESSION

6.3.3 Regression with Indicator Variables

Indicator variables are nominal descriptors (see Chapter 1, Section 1.4.1) which can take one of a limited number of values, usually two. They are used to distinguish between different classes of members of a data set. This situation most commonly arises due to the presence or absence of specific chemical features; for example, an indicator variable might distinguish whether or not compounds contain a hydroxyl group, or have a meta substitution. An indicator variable may be used to combine two data sets which are based on different parent structures. Clearly, the dependent data for the different sets should be from the same source, oth- erwise there would be little point in combining them, and there should be some common physicochemical descriptors (but see later in this section, Free–Wilson method). Indicator variables are treated in multiple regression just as any other variable with regression coefficients computed by least squares. An example of this can be seen in the correlation of reverse phase HPLC capacity factors and calculated octanol/water parti- tion coefficients for the xanthene and thioxanthene derivatives shown in Figure 6.10 [14].

The correlation is given by Equation (6.27) in which the term D

was used to indicate the presence (D = 1) or absence (D = 0) of

the –NHCON(NO)– group, in other words series I or series II in Figure 6.10.

log P= 0.813(±0.027) log kw+ 2.114(±0.161)D (6.27)

Figure 6.10 Parent structures for the compounds described by Equations (6.27) and (6.28).

Examination of the log kwvalues showed that the replacement of oxygen by sulphur did not produce the expected increase in lipophilicity and it was found that a second indicator variable, S, to show the presence or absence of sulphur could be added to the equation to give:

log P = 0.768(±0.021) log kw+ 2.115(±0.115)D + 0.415(±0.095)S

n= 24 r = 0.985 s = 0.260 (6.28)

The correlation coefficient for Equation (6.28) is slightly improved over that for Equation (6.27) (but see Section 6.4.3), the standard error has been reduced, and the regression coefficients for the log kwand D terms are more or less the same. This demonstrates that this second indicator variable is explaining a different part of the variance in the log P values. It may have been noticed that Equations (6.27) and (6.28) do not contain intercept terms: this is because the intercepts are not significantly different to zero. These examples show how indicator variables can be used to improve the fit of regression models, but do the indicator variables (actually their regression coefficients) have any physicochemical meaning? The answer to this question is a rather unsatisfactory ‘yes and no’. The sign of the regression coefficient of an indicator variable shows the direction (to reduce or enhance) of the effect of a particular chemical feature on the dependent variable while the size of the coefficient gives the magnitude of the effect. This does not necessarily bear any relationship to any particular physicochemical property, indeed it may be a mathe- matical artefact as described later. On the other hand, it may be possible to ascribe some meaning to indicator variable regression coefficients. The log P values used in Equations (6.27) and (6.28) were calculated by

the Rekker fragmental method (see Section 9.2.1 and Table 9.2). This procedure relies on the use of fragment values for particular chemical groups and the –NHCON(NO)– group, accounted for by the indicator D, was missing from the scheme. The regression coefficient for this indi- cator variable has a fairly constant value, 2.114 in Equation (6.27) and 2.115 in Equation (6.28), suggesting that this might be a reasonable esti- mate for the fragment contribution of this group. Measurement of log P values for two compounds in set I allowed an estimate of−2.09(±0.14) to be made for this fragment, in good agreement with the regression coefficient of D. At first sight this statement may seem surprising since the signs of the fragment value and regression coefficients are different. The calculated log P values used in the equations did not take account of the hydrophilic (negative contribution) nitrosureido fragment and thus are bigger, by 2.11, than the experimentally determined HPLC capacity factors.

How does an indicator variable serve to merge two sets of data? The effect is difficult to visualize in multiple dimensions but can be seen in two dimensions in Figure 6.11.

Here, the two lines represent the fit of separate linear regression models, for multiple linear regression these would be surfaces. If the indicator variable has a value of zero for the compounds in set A it will have no effect on the regression line, whatever the value of the fitted regression coefficient. For the compounds in set B, however, the indicator variable has the effect of adding a constant to all the log 1/C values (1× regression coefficient of the indicator variable). This results in a displacement of the regression line for the B subset of compounds so that it merges with the line for the A subset.

Figure 6.11 Illustration of two subsets of compounds with different (parallel) fitted lines.

An indicator variable can be very useful in combining two subsets of compounds in this way since it allows the creation of a larger set which may lead to more reliable predictions. It is also useful to be able to describe the activity of compounds which are operating by a similar mechanism but which have some easily identified chemical differences. However, the situation portrayed in Figure 6.11 is ideal in that the two regression lines are of identical slope and the indicator variable simply serves to displace them. If the lines were of different slopes the indicator may still merge them to produce an apparently good fit to the larger set, but in this case the fitted line would not correspond to a ‘cor- rect’ fit for either of the two subsets. This situation is easy to see for a simple two-dimensional case but would clearly be difficult to identify for multiple linear regression. A way to ensure that an indicator variable is not producing a spurious, apparently good, fit is to model the two subsets separately and then compare these equations with the equation using the indicator. The situation can become even more compli- cated when two or more indicator variables are used in multiple regression equations; great care should be taken in the interpretation of such models.

An interesting technique which dates from the early days of modern QSAR, known as the Free and Wilson method [15], represents an ex- treme case of the use of indicator variables, since regression equations are generated which contain no physicochemical parameters. This technique relies on the following assumptions.

1. There is a constant contribution to activity from the parent structure.

2. Substituents on the parent make a constant contribution (positive or negative) to activity and this is additive.

3. There are no interaction effects between substituents, nor between substituents and the parent.

Of these assumptions, 1 is perhaps the most reasonable and 3 the most unlikely. After all, it is the interaction of substituents with the elec- tronic structure of the parent that gives rise to Hammettσ constants (see Chapter 10). However, despite any misgivings concerning the assumptions,6 _{this method has the attractive feature that it is not necessary to}

6_{The first two assumptions are implicit, although often not stated, in many other QSAR/QSPR} methods. The third assumption may be accounted for to some extent by the deliberate inclusion of several examples of each substituent.

Table 6.7 Free–Wilson data table (reproduced from ref. [16] with permission of the Collection of Czechoslovak Chemical Communications).

R1 R2 Compound H 4–CH3 4–Cl 3–Br H 4–CH3 4–OCH3 XI 1 0 0 0 1 0 0 XII 1 0 0 0 0 1 0 XIII 1 0 0 0 0 0 1 XIX 0 1 0 0 0 0 1 XXX 0 0 1 0 0 1 0 XXXV 0 0 0 1 1 0 0

measure or calculate any physicochemical properties; all that is required are measurements of some dependent variable. The technique operates by the generation of a data table consisting of zeroes and ones. An example of such a data set is given in Table 6.7 for six compounds based on the parent structure shown in Figure 6.12.

A Free–Wilson table will also contain a column or columns of dependent (measured) data; for the example shown in Table 6.7 results were given for minimum inhibitory concentration (MIC) against two bacteria, Mycobacterium tuberculosis and Mycobacterium kansasii. Each column in a Free–Wilson data table, corresponding to a particular substituent at a particular position, is treated as an independent variable. A multiple regression equation is calculated in the usual way between the dependent variable and the independent variables with the regression statistics in- dicating goodness of fit. The regression coefficients for the independent variables represent the contribution to activity of that substituent at that position, as shown in Table 6.8. In this table, for example, it can be seen that replacement of hydrogen with a methyl substituent (R1) results in a reduction in activity (increase in MIC) against both bacteria.

One of the disadvantages of the Free–Wilson method is that – unlike regression equations based on physicochemical parameters – it cannot

Figure 6.12 Parent structure for the compounds given in Table 6.7 (reproduced from ref. [16] with permission of the Collection of Czechoslovak Chemical Commu- nications).

Table 6.8 Activity contributions for substituents as determined by the Free–Wilson technique (reproduced from ref. [16] with permission of the Collection of Czechoslovak Chemical Communications).

MIC against

Substituent M. kansasiia _{M. tuberculosis}b

4–H −0.397 −0.116 4–CH3 0.264 0.101 4–OCH3 0.290 0.337 4–Cl 0.095 −0.101 3–Br −0.253 −0.312 4–H −0.078 0.088 4–CH3 0.260 0.303 4–OCH3 −0.081 0.085 4–Cl 0.403 0.303 3,4–Cl2 −0.259 −0.586 4–C–C6H11 −0.589 −0.399 4_–Br _0.345 _0.205 μoc 1.871 1.887 a_{Fit statistics, r}_{= 0.774, s = 0.43, F = 3.59, n = 35.} b_r_{= 0.745, s = 0.42, F = 3.01, n = 35.} c_μ

ois the (constant) contribution of the parent structure to MIC.

be used to make predictions for substituents not included in the original analysis. The technique may break down when there are linear depen- dencies between the structural descriptors, for example, when two substituents at two positions always occur together, or where interactions between substituents occur. Advantages of the technique include its ability to handle data sets with a small number of substituents at a large number of positions, a situation not well handled by other analytical methods, and its ability to describe quite unusual substituents since it does not require substituent constant data. A number of variations and improve- ments have been made to the original Free and Wilson method, these and applications of the technique are discussed in a review by Kubinyi [17].

6.4 MULTIPLE REGRESSION: ROBUSTNESS,

In document Livingstone, Data Analysis (Page 188-193)