WANG, JIUZHOU Tree-structured Classification for Multivariate Binary Responses.
(Un-der the direction of Professor Leonard A. Stefanski and Professor Marc G. Genton).
by
JIUZHOU WANG
A dissertation submitted to the Graduate Faculty of
North Carolina State University
in partial satisfaction of the
requirements for the Degree of
Doctor of Philosophy
STATISTICS
Raleigh
2002
Approved By:
Professor Leonard A. Stefanski
Professor Marc G. Genton
Chair of Advisory Committee
Co-Chair of Advisory Committee
To Meimei
and
Biography
Jiuzhou Wang was born to Fengxiang Wang and Chunping Tan on April 10, 1973 in
Qiqi-haer, China.
Acknowledgements
Contents
List of Figures
vii
List of Tables
viii
1
Introduction
1
1.1
A motivating problem . . . .
1
1.2
Available approaches . . . .
6
1.2.1
Tree-structured methods for univariate response . . . .
7
1.2.2
Tree-structured methods for multivariate responses . . . .
7
2
The Majority-vote Method
12
2.1
Notations and definitions . . . .
12
2.2
Growing a tree . . . .
16
2.3
Discretizing the response . . . .
20
2.4
Evaluating the relative information loss due to discretization . . . .
22
2.4.1
Multivariate Analysis of Variance (MANOVA) . . . .
23
2.4.2
Example: an analysis on a 576
×
10 data
. . . .
25
2.5
Evaluating the predictivity of a tree: hit rate . . . .
30
2.6
Determining the tree size . . . .
31
3
Simulation Study Design
35
3.1
The underlying assumption of MultiSCAM . . . .
36
3.2
Simulating complete data . . . .
37
3.3
Simulating censored data . . . .
40
3.4
Evaluation rules
. . . .
46
4
Results of Simulation Study
51
4.1
Hit rates comparison: resubstitution estimates
. . . .
51
4.1.1
The design of experiment . . . .
52
4.1.2
The results . . . .
53
4.2
Hit rates comparison: test-sample estimates . . . .
55
4.2.2
The results on complete data . . . .
56
4.2.3
The results on censored data . . . .
59
4.3
Summary of the simulation study . . . .
62
5
Discussions, Conclusions and Future Work
65
5.1
Tree-structured methods: univariate to multivariate . . . .
65
5.2
Conclusions . . . .
67
5.3
Future work . . . .
68
A Summary of Algorithms
70
A.1 Pseudo code of MultiSCAM . . . .
70
A.2 The tree-growing algorithm of the majority-vote method . . . .
71
A.3 The complete-data generating algorithm . . . .
72
A.4 The censored-data generating algorithm . . . .
73
A.5 The tree-size determination algorithm . . . .
73
B Proofs of Propositions
75
B.1 Proof of Proposition 2.1 . . . .
75
B.2 Proof of Proposition 2.2. . . .
77
List of Figures
1.1
A hypothetical tree. The root node (N), which contains 576 observations, is
split on
X
37
; the left subnode of N (N0), which corresponds to
X
37
= 0
, is
further split on
X
16
.
. . . .
6
1.2
Histograms of 10 response variables in a real QSAR data set. The skewness
in response variables is obvious. The pole at 4.0 in the plot for variable
Y
4
,
for example, is a result of single imputation (4.0 at censored values)
. . . .
10
2.1
Node profiles of N00, N01 and N1 in Figure 1.1. The height of each bar
represents the proportion of active compounds.
. . . .
19
2.2
ρ
b
j
(
c
)
as function of
c
for a
576
×
5
response data.
. . . .
21
2.3
Two-way scatter plot of the 10 response variables. There is an apparent
linear relationship between
Y
2
and
Y
3
. The dots on a straight line in other
plots correspond to the imputed values (e.g. 4.0).
. . . .
26
2.4
Part of the tree generated by MultiSCAM on the 576
×
10 data set
. . . .
27
2.5
Resubstitution hit rate (
H
RS
) and cross-validation hit rate (
H
CV
) as
func-tions of
∆
min
. The triangles are values of
H
RS
and the circles are values of
H
CV
.
. . . .
34
3.1
A simple MultiSCAM tree
. . . .
36
3.2
Two-way scatter plots of a simulated censored data
. . . .
45
3.3
True and predicted binary activity
. . . .
49
4.1
H
CV
and
H
T S
as functions of
∆
min
for the 20 complete data sets. Each data
set corresponds to a panel. The solid lines are
H
T S
and the dotted lines are
H
CV
.
. . . .
58
4.2
H
CV
and
H
T S
as functions of
∆
min
for the 20 censored data sets. Each data
set corresponds to a panel. The solid lines are
H
T S
and the dotted lines are
H
CV
.
. . . .
62
List of Tables
1.1
Tree-structured Methods for Univariate Response
. . . .
8
2.1
Node profiles of nodes N00, N01 and N1 in Figure 1.1
. . . .
18
2.2
Mean, median,
c
max
and
ρ
b
(
c
max
)
. The relationship Median
<
Mean
< c
max
verifies Proposition 2.2.
. . . .
22
2.3
The relative information loss due to discretization. The numbers listed are
the ratios of the
F
-values for MultiSCAM and the majority-vote method.
. .
29
3.1
Two artificial data sets: The left table is the “true” data and the right table
is the “observed” data. A star (*) indicates the cell is censored.
. . . .
38
3.2
An example of the real response matrix
Z
and its censoring indicator matrix
D
43
3.3
An example of the simulated complete response matrix
Y
(
com
)
and its
cen-soring indicator matrix
C
. . . .
43
3.4
An example of the simulated response matrix
Y
(
cen
)
. A number in bold
indi-cates that the corresponding cell is censored.
. . . .
44
3.5
2
×
2
table for computing hit rate
. . . .
48
4.1
Resubstitution hit rates of MultiSCAM and the majority-vote method for 20
complete and censored data sets. The 95% CI of
H
(
com,M V
)
−
H
(
com,M SCAM
)
is
(0
.
007
,
0
.
019)
; the 95% CI of
H
(
cen,M V
)
−
H
(
cen,M SCAM
)
is
(0
.
021
,
0
.
033)
.
53
4.2
Test-sample hit rates comparison for the 20 complete data sets. The 95% CI
of
H
T S
(
com,M V
)
−
H
T S
(
com,M SCAM
)
is
(
−
0
.
002
,
0
.
012)
.
. . . .
56
4.3
The effect of
∆
min
on hit rate for the 20 complete data sets. The trees are
built at 5 different values of
∆
min
:
∆
opt
,
∆
opt
±
1
, 3 and 48.
. . . .
57
4.4
Test-sample hit rates comparison for the 20 censored data sets. The 95%
confidence interval of
H
T S
cen,M V
−
H
T S
cen,M SCAM
is
(0
.
018
,
0
.
039)
.
. . . .
60
4.5
The effect of
∆
min
on hit rate for the 20 censored data sets. The trees are
Chapter 1
Introduction
In many applications, multivariate responses are of great interest. When a small
number of covariates are associated with these responses and the responses are continuous
measurements, both parametric and nonparametric models are available in the literature
for modeling the relationship between the responses and the covariates. With a large
num-ber of covariates, many of the parametric models do not work well due to dimensionality
problems. Keefer (2001) presented a recursive partitioning algorithm for analyzing
multi-variate
continuous
responses that is intended to handle a large number of binary covariates.
The objective of our work is to investigate the use of a recursive partitioning algorithm for
multivariate
binary
responses.
1.1
A motivating problem
QSAR study, the biological activity of a number of compounds on a few target proteins is
measured. The measurement of biological activity is quantitative and continuous, usually in
the range of (3
,
10) after log transformation. These activity measurements can be arranged
in a matrix for which each column corresponds to a target protein and each row corresponds
to a compound. Let us call it the
response matrix
and denote it by
Y. The molecule
structure of the compounds is recorded in terms of atom-pair descriptors. The structure
information is placed into a matrix where each row describes a compound and each column
represents a particular atom pair molecular feature. The entry at the
i
th row and the
j
th
column is 1 if the
j
th atom pair is present in the
i
th compound and 0 otherwise. Therefore,
this
descriptor matrix
(denoted by
X) is a binary matrix. Note that computational chemists
also make continuous molecular descriptors available, but we will not discuss the use of
continuous descriptors in this work. To summarize, we have a response matrix
Y
∈
R
n
×
q
and a descriptor matrix
X
∈ {
0
,
1
}
n
×
p
, where
•
n
is the number of compounds involved in the study (typically 500
≤
n
≤
5000),
•
p
is the number of atom-pair descriptors (typically 3000
≤
p
≤
8000),
Let
X
i
and
Y
i
denote the
i
th row of
X
and
Y, respectively (
i
= 1
, . . . , n
). The following is
a typical data layout with
n
= 576,
q
= 10 and
p
= 3270.
Y
=
4
.
0 5
.
6
· · ·
6
.
1
9
.
0 9
.
2
· · ·
7
.
9
..
.
..
.
. ..
..
.
8
.
3 4
.
4
· · ·
4
.
2
n
×
q
,
X
=
1 0
· · ·
1
0 0
· · ·
1
..
.
..
.
. .. ...
0 1
· · ·
0
n
×
p
.
(1.1)
A model for the regression of
Y
i
on
X
i
can be used to describe the relationship between
structure and activity,
E
(Y
i
|
X
i
) =
f
(X
i
)
,
i
= 1
, . . . , n.
Possible candidates of
f
(
·
)
∈
R
q
include
•
linear function of
X
i
:
f
(X
i
) =
β
0
+
β
1
X
i
,
β
0
∈
R
q
,
β
1
∈
R
q
×
p
,
i
= 1
, . . . , n
.
•
piecewise function of
X
i
:
f
(X
i
) =
α
k
if
X
i
∈
G
k
for
k
= 1
, . . . , K
, where
α
k
∈
R
q
and
G
1
, . . . , G
K
⊂ {
0
,
1
}
p
is a partition of the row space of
X,
i
= 1
, . . . , n
.
In QSAR studies, traditional linear models have not received much attention for the
follow-ing reasons:
(i) The curse of dimensionality (Bellman 1961): the number of descriptors
p
is usually
much greater than the number of compounds
n
.
about which interactions to include. For example, in the typical data shown in
Equa-tion (1.1), there are 3270 descriptors and if up to four-way interacEqua-tions are initially
considered, we need 3270
4
interaction terms.
(iii) There may be a mixture of mechanisms where the activity of some compounds follows
one model and the activity of other compounds follows a different model. And in this
case, it is difficult to interpret the activity model in terms of molecular structure with
linear regression models.
Tree-structured approaches have been shown to be successful in overcoming these
diffi-culties. Hawkins et al. (1997) analyzed a large structure-activity data set using recursive
partitioning where the response is univariate (i.e.
q
= 1). Keefer (2001) investigated the use
of recursive partitioning in analyzing structure-activity data with multivariate continuous
responses.
assumptions under which the EM algorithm is carried out are unrealistic in our problem.
In an attempt to lessen the effect of data incompleteness, we proceed as follows.
1. Discretize the original response into binary data (1 for active, 0 for inactive),
2. Use a tree-structured method to analyze the resulting binary data.
1.2
Available approaches
In this section, we give a brief review of the literature on tree-structured methods for
univariate responses, and introduce two currently available approaches to growing trees for
multivariate responses.
Tree-structured methods recursively partition the feature space,
576
328
112
216
248
@
@
@
@
@
@
N
N0
N1
N00
N01
X
37
= 0
X
37
= 1
X
16
= 0
X
16
= 1
Figure 1.1:
A hypothetical tree. The root node (N), which contains 576 observations, is split
on
X
37
; the left subnode of N (N0), which corresponds to
X
37
= 0
, is further split on
X
16
.
i.e. the row space of
X
, into disjoint regions. The resulting partition can be represented by
a
dendrogram
or a
tree
as shown in Figure 1.1. Tree-based methods differ according to the
following factors:
1. Response type (continuous/categorical, univariate/multivariate);
2. Covariate type (continuous/categorical);
3. Splitting criteria;
1.2.1
Tree-structured methods for univariate response
An early tree-structured method, Automatic Interaction Detection (AID)
intro-duced by Morgan and Sonquist (1963), splits a node on a predictor maximizing the
between-node sum-of-squares. Kass (1975) introduced CHi-squared Automatic Interaction
Detec-tion (CHAID) which was later developed into Formal Inference-based Recursive Modelling
(FIRM) by Hawkins (1982). The splitting criteria of both CHAID and FIRM rely on
sta-tistical tests and multiplicity adjustments that can greatly reduce AID’s tendency to overfit
the data. Breiman et al. (1984) introduced Classification And Regression Trees (CART),
that bases splits on a measure of node impurity. CART uses backward selection to
de-termine the size of tree. In CART, a large tree is pruned using cross-validated, minimal
cost-complexity criterion after growing the tree in a forward direction which is usually too
large. The following outline of the pruning method in CART is cited from Segal (1988): (1)
initially growing a very large tree; (2) iteratively pruning this tree all the way back up (to
the root node), thereby creating a nested sequence of trees; (3) selecting the best tree from
this sequence using test-sample or cross-validation estimates of error.
The differences and similarities between FIRM and CART for univariate response
are summarized in Table 1.1.
1.2.2
Tree-structured methods for multivariate responses
Table 1.1:
Tree-structured Methods for Univariate Response
FIRM
CART
Response
continuous/categorical
continuous/categorical
Covariate
continuous/categorical
continuous/categorical
Splitting criteria
t
-test,
F
-test,
χ
2
-test
node impurity
Tree size determination
forward selection, multiplicity
adjustment
backward
selection,
cost-complexity pruning
1972). It is a generalization of Morgan and Sonquist’s original work on AID. Recently,
re-searchers have given attention to tree-structured methods for multivariate responses. Segal
(1992) extended the tree-structured regression paradigm to multiple response settings and
in particular to longitudinal data. Segal’s method is closely related to CART in that it uses
the minimal cost-complexity pruning to determine the tree size. Keefer (2001) discussed
an algorithm called MultiSCAM for recursive partitioning of multivariate continuous
re-sponses in a QSAR context. By assuming multivariate normality of the response variables,
MultiSCAM judges splits by the
p
-value from a two-sample Hotelling’s
T
2
test, and adjusts
the
p
-value for multiple comparisons. The covariate that minimizes the adjusted
p
-value
is chosen to split the data. Basically, MultiSCAM evolved from FIRM. The use of
multi-plicity adjustments in MultiSCAM reduces the tendency of overfitting which was not well
handled in MAID-M. For reference, the pseudo-code of MultiSCAM algorithm is shown in
Appendix A.1. The algorithm starts with the root node which comprises all the
observa-tions. Because the procedures for splitting nodes are identical, we present the algorithm for
a generic node
t
=
{
y
1
, . . . ,
y
n
}
with size
n
in Appendix A.1.
multivari-ate binary responses are of interest. We will call Zhang’s method “the generalized entropy
approach.” A distribution from the exponential family is used to fit the multivariate binary
data in this approach. The generalized entropy criterion is defined as the maximum of the
log-likelihood derived from this distribution. Given this definition of node homogeneity (or
node impurity), the rest of the approach is identical to CART.
Though conceptually simple and statistically meaningful, both MultiSCAM and
the generalized entropy approach have limitations. For MultiSCAM, if the normality
as-sumption is violated (e.g. skewness in responses), it is questionable that the Hotelling’s
T
2
test has the right size. Figure 1.2 shows the histograms of 10 response variables in a real
QSAR data set. Skewness in responses is obvious in this case. Although in the presence
of non-normality, one can take remedial steps to transform the data to normality in the
univariate case, it is sometimes difficult to treat multivariate non-normality by
transforma-tion methods. Mardia (1975) pointed out that the Hotelling’s
T
2
test is more sensitive to
skewness than to kurtosis. In addition, data incompleteness cannot be easily handled in
MultiSCAM. Zhang’s generalized entropy approach involves computing the maximum
like-lihood estimates of the parameters in the multivariate binary distribution model for every
possible split. The procedure for finding the MLE has to be carried out through iteration,
because there is no closed-form solution. When thousands of covariates are involved, the
computational burden is enormous.
3 4 5 6 7
0.0
0.4
Y1
Prob
4 5 6 7 8 9
0.0
0.3
Y2
Prob
3 4 5 6 7 8 9
0.0
0.20
Y3
Prob
4 5 6 7 8
0.0
0.4
Y4
Prob
4 5 6 7 8
0.0
0.3
Y5
Prob
4 5 6 7 8
0.0
0.4
Y6
Prob
4 5 6 7 8
0.0
0.3
Y7
Prob
4.0 4.5 5.0 5.5 6.0 6.5
0.0
1.5
Y8
Prob
4 5 6 7 8 9
0.0
0.3
Y9
Prob
4 5 6 7 8
0.0
0.6
Y10
Prob
Figure 1.2:
Histograms of 10 response variables in a real QSAR data set. The skewness in
response variables is obvious. The pole at 4.0 in the plot for variable
Y
4
, for example, is a
result of single imputation (4.0 at censored values)
Segal’s approach to the multivariate situation partly results from the lack of a natural way
to order multivariate data.
algorithm for analyzing multivariate binary responses that are obtained by discretizing
the continuous responses. Here, it is necessary to emphasize that discretizing does not
completely solve the incomplete data problem. Rather, it alleviates the problem as discussed
in Section 1.1. We expect that the price we pay on information loss (due to discretization)
will be rewarded by a more valid and robust inference.
Chapter 2
The Majority-vote Method
2.1
Notations and definitions
We denote the covariate matrix by
X
∈ {
0
,
1
}
n
×
p
where
p
covariate variables
(columns) are measured on
n
individuals (rows). The response matrix is denoted by
Y
∈
{
0
,
1
}
n
×
q
where
q
response variables (columns) are observed on
n
individuals (rows). For
the
i
th individual (
i
= 1
, . . . , n
), the covariates and responses are
X
i
= (
X
i
1
, . . . , X
ip
) and
Y
i
= (
Y
i
1
, . . . , Y
iq
), respectively. A typical set of data looks like
Y
11
Y
12
· · ·
Y
1
q
Y
21
Y
22
· · ·
Y
2
q
..
.
..
.
. ..
..
.
Y
n
1
Y
n
2
· · ·
Y
nq
,
X
11
X
12
· · ·
X
1
p
X
21
X
22
· · ·
X
2
p
..
.
..
.
. ..
..
.
X
n
1
X
n
2
· · ·
X
np
,
where the
X
ij
and
Y
ik
are either 0 or 1. For convenience, the column vectors of
X
and
Y
8000,
q
is between 5 and 15.
Definition 2.1
A node t is a subset of the data. A node that comprises all the response
data is called the root node.
root node =
{
Y
i
:
i
= 1
, . . . , n
}
Definition 2.2
The size of node t,
n
(
t
)
, is the number of response vectors in node
t
.
Definition 2.3
The profile of node
t
is the mean response vector of node
t
,
(
p
1
(
t
)
, p
2
(
t
)
, . . . , p
q
(
t
))
,
where
p
k
(
t
) =
1
n
(
t
)
X
Y
i∈
node
t
Y
ik
.
Definition 2.4
Majority-vote prediction: Each response vector in a node
t
is predicted by
a binary
q
-vector
Y(
ˆ
t
) =
Y
ˆ
1
(
t
)
, . . . ,
Y
ˆ
q
(
t
)
, where
ˆ
Y
k
(
t
) =
I
(
p
k
(
t
)
≥
1
/
2)
is the mode of the
k
th response variable in node
t
.
Note.
Y(
ˆ
t
) in Definition 2.4 is indeed a multivariate mode defined by univariate modes.
Example 1.
Suppose node
t
comprises five observations of
q
=4 response variables:
1 0 0 1
0 1 0 0
0 1 1 1
1 1 0 1
0 1 0 0
then the size of
t
is
n
(
t
) = 5 and ˆ
Y(
t
) = (0
,
1
,
0
,
1).
2
In the univariate response case (i.e.
q
= 1, each response is a scalar), CART uses
the following measures of node impurity.
1. Misclassification Rate:
{
n
(
t
)
}
−
1
P
Y
i1∈
t
I
Y
i
1
6
= ˆ
Y
1
(
t
)
,
2. Gini Index: 2
p
1
(
t
)(1
−
p
1
(
t
)),
3. Cross Entropy:
−
p
1
(
t
) log
p
1
(
t
)
−
(1
−
p
1
(
t
)) log(1
−
p
1
(
t
)),
Typically, either Gini index or cross entropy is used as the basis of the splitting
crite-ria, while misclassification rate is used for evaluating the quality of a tree. Theoretically,
distribution-based generalizations of these measures to the case of multivariate responses
require modeling the multivariate binary vector. Three approaches for directly modeling
multivariate binary distributions are discussed in Cox (1972):
(i)
Independent variables.
The component binary responses are treated as independent
variables.
(ii)
Arbitrary multinomial distributions.
The sample is treated as a multinomial one with
2
q
−
1 independent parameters corresponding to the 2
q
distinct combinations.
(iii)
Logistic models.
This kind of models is intermediate between (i) and (ii), allowing
the presence of special kinds of dependence. Let
Y
i
be the
i
th response variable and
Z
i
= 2
Y
i
−
1, so that the
Z
’s take values
±
1 (
i
= 1
, . . . , q
). Suppose that
log
P
(
Z
1
=
z
1
, . . . , Z
q
=
z
q
) =
X
i
α
i
z
i
+
X
i<j
where Λ is a normalizing constant. If only the first degree terms are included, we have
the independence model (i), whereas if all terms up to
z
1
· · ·
z
q
are taken we have the
general multinomial model (ii).
Zhang (1998) generalized the cross entropy criterion of CART using a special form of
Equa-tion (2.1):
log
P
(
Z
1
=
z
1
, . . . , Z
q
=
z
q
) =
X
i
α
i
z
i
+
β
X
i<j
z
i
z
j
−
Λ
,
(2.2)
where only one parameter
β
is used for two-way cross products. In this case, the procedure
for estimating the parameters has to be carried out through iteration for each possible split.
Thus, when thousands of covariates are involved, the computation task is formidable. The
independence model (i) should not be appropriate in our case, because correlations among
response variables are expected due to structural similarities among the proteins. The
gen-eral multinomial model (ii) is only applicable when
n
, the number of observations, is fairly
large and
q
is fairly small, so that each cell contains a reasonable number of observations.
In a typical data in our study, e.g., the data given by Equation (1.1),
n
= 576,
q
= 10, and
2
q
= 1024. Thus many cells contain no observations.
Definition 2.5
The Generalized Misclassification Rate of node
t
is defined as
R
(
t
) =
1
qn
(
t
)
q
X
k
=1
X
Y
i∈
t
I
Y
ik
6
= ˆ
Y
k
(
t
)
=
1
q
q
X
k
=1
R
k
(
t
)
(2.3)
where
R
k
(
t
) =
1
n
(
t
)
X
Y
i∈
t
I
Y
ik
6
=
Y
b
k
(
t
)
Remarks.
R
k
(
t
) = min (
p
k
(
t
)
,
1
−
p
k
(
t
)). If
p
k
(
t
)
≥
1
/
2, then
Y
b
k
(
t
) = 1 and
Y
ik
6
=
Y
b
k
(
t
) is
equivalent to
Y
ik
= 0, hence
R
k
(
t
) = 1
−
p
k
(
t
). Otherwise, if
p
k
(
t
)
<
1
/
2, then
Y
b
k
(
t
) = 0
and
Y
ik
6
=
Y
b
k
(
t
) is equivalent to
Y
ik
= 1, hence
R
k
(
t
) =
p
k
(
t
). Therefore,
R
k
(
t
) = min (
p
k
(
t
)
,
1
−
p
k
(
t
))
(2.4)
Example 2.
For the node given in Example 1,
R
(
t
) =
(4)(5)
1
(2 + 1 + 1 + 2) = 0
.
3.
2
The generalized misclassification rate is just the average of (univariate)
misclassifi-cation rates of each
Y
(
k
)
under the majority-vote criterion. Similarly, a simple generalization
of Gini index is given by
Definition 2.6
The Generalized Gini Index of node
t
:
G
(
t
) =
1
q
q
X
k
=1
2
p
k
(
t
) (1
−
p
k
(
t
))
,
where
p
k
(
t
) =
{
n
(
t
)
}
−
1
P
Y
i∈
t
Y
ik
.
2.2
Growing a tree
measure. Because the split procedures for every node are identical, we present the algorithm
for a generic node
t
. Recall that we restrict ourselves to the case in which all covariate
variables are binary.
Because the covariates are binary, each covariate corresponds to a unique split of
node
t
, according to its value: a split of node
t
on variable
X
(
j
)
results in two subnodes of
t
, the left subnode
t
L
(
j
) and the right subnode
t
R
(
j
), where
t
L
(
j
) =
{
Y
i
∈
t
:
X
ij
= 0
}
and
t
R
(
j
) =
{
Y
i
∈
t
:
X
ij
= 1
}
.
Proposition 2.1
n
(
t
)
R
(
t
)
≥
n
(
t
L
(
j
))
R
(
t
L
(
j
)) +
n
(
t
R
(
j
))
R
(
t
R
(
j
))
,
∀
j
∈ {
1
, . . . , p
}
.
The proof of Proposition 2.1 appears in Appendix B.1. Note that
n
(
t
)
R
(
t
) measures the
total node impurity of node
t
, and the right-hand side is the total node impurity of two
re-sulting subnodes of
t
. Proposition 2.1 indicates that the total node impurity never increases
after a split. The following example illustrates this idea.
Example 3.
Refer to Example 1.
Suppose we want to split the node on covariate
X
(1)
= (0
,
1
,
0
,
0
,
1)
T
, then the resulting subnodes will be
t
L
(1) =
{
Y
1
,
Y
3
,
Y
4
}
and
t
R
(1) =
{
Y
2
,
Y
5
}
, i.e.,
t
L
(1) =
1 0 0 1
0 1 1 1
1 1 0 1
,
t
R
(1) =
0 1 0 0
0 1 0 0
.
We have
R
(
t
L
(1)) =
(4)(3)
1
(1+1+1+0) = 0
.
25 and
R
(
t
R
(1)) =
(4)(2)
1
(0+0+0+0) = 0. Note
that
n
(
t
)
R
(
t
) = 5
×
0
.
3 = 1
.
5,
n
(
t
L
(1))
R
(
t
L
(1)) = 3
×
0
.
25 = 0
.
75 and
n
(
t
R
(1))
R
(
t
R
(1)) =
2
×
0 = 0. Here,
That is, every possible split results in “purer”
1
subnodes.
2
The “best” split of node
t
is defined as the variable
x
(
j
0)
for which the decrease in
node impurity is maximized, or,
j
0
= arg max
j
{
n
(
t
)
R
(
t
)
−
[
n
(
t
L
(
j
))
R
(
t
L
(
j
)) +
n
(
t
R
(
j
))
R
(
t
R
(
j
))]
}
.
Let us denote the maximum decrease by ∆
R
(
t
). i.e.
∆
R
(
t
) =
n
(
t
)
R
(
t
)
−
[
n
(
t
L
(
j
0
))
R
(
t
L
(
j
0
)) +
n
(
t
R
(
j
0
))
R
(
t
R
(
j
0
))]
(2.5)
Having found the best split
j
0
for the node
t
, we repeat this procedure for each resulting
subnode
t
L
and
t
R
until some prespecified threshold is reached (e.g. the size of a node
is less than 5, or ∆
R
(
t
)
<
10). By this algorithm, recursively partitioning the root node
will eventually result in a tree
T
, the size of which (i.e. number of terminal nodes) can
be controlled by adjusting the threshold values of node size and/or ∆
R
(
t
). The tree built
can then be used to interpret the relationship between
Y
and
X. We illustrate this in the
following example.
Example 4.
Consider the tree shown in Figure 1.1, where we have
q
= 5 response variables.
Assume the profiles of node N00, N01 and N1 are as shown in Table 2.1. Note that
p
k
(
t
)
Table 2.1:
Node profiles of nodes N00, N01 and N1 in Figure 1.1
Node
t
p
1
(
t
)
p
2
(
t
)
p
3
(
t
)
p
4
(
t
)
p
5
(
t
)
N00
0.1
0.8
0.2
0.2
0.4
N01
0.3
0.4
0.9
0.2
0.2
N1
0.9
0.2
0.3
0.1
0.1
is the proportion of 1s in node
t
for the
k
th response variable
Y
k
. Recall that the value 1
corresponds to active response. For example, the profile of node N1 is (0
.
9
,
0
.
2
,
0
.
3
,
0
.
1
,
0
.
1),
i.e., of all the 248 compounds in node N1 (refer to Figure 1.1), 248
×
0
.
9
≈
223 are active
against the first protein (P1), 248
×
0
.
2
≈
49 the second, and so on. Thus, we could say
that node N1 is highly selective against P1, i.e., most compounds in node N1 are active on
P1, but not active on other proteins. On the other hand, all the compounds in node N1
have a common feature:
X
37
= 1. One can interpret the selectivity of node N1 as a result
of having chemical feature
X
37
. Other nodes can be explained in the same way. The node
profiles in Table 2.1 are visualized in Figure 2.1.
2
Y1
Y2
Y3
Y4
Y5
N00
0.0
0.4
0.8
profile
Y1
Y2
Y3
Y4
Y5
N01
0.4
0.8
profile
Y1
Y2
Y3
Y4
Y5
N1
0.0
0.4
0.8
profile
Figure 2.1:
Node profiles of N00, N01 and N1 in Figure 1.1. The height of each bar
represents the proportion of active compounds.
Appendix A.2.
2.3
Discretizing the response
This section discusses how to choose the cutoff for discretizing the response. In
general, discretizing the response will reduce information. Thus, the quality of the resulting
binary data will affect the analysis. Hence, choosing the “best” cutoff is important. So
that the discretized data preserves as much information as possible, we choose the cutoff
for which the correlation between the continuous response and the corresponding binary
response is maximized.
Let us first consider discretizing a continuous random variable
Y
. Let
Z
c
=
I
(
Y
≥
c
)
denote the discretized binary random variable, where
c
is the cutoff point. The correlation
coefficient between
Y
and
Z
c
ρ
(
c
) = corr(
Y, I
(
Y
≥
c
)) =
E
p
((
Y
−
E
(
Y
))
Z
c
)
Var(
Y
)Var(
Z
c
)
(2.6)
is a measure of the information of
Y
preserved in
Z
c
. Defince
c
max
as the value of
c
that
maximizes
ρ
(
c
), i.e.
c
max
= arg max
c
ρ
(
c
) = arg max
c
corr(
Y, I
(
Y
≥
c
))
.
Although it is generally difficult to find a closed-form solution of
c
max
for a given
Y
, under
fairly general conditions,
c
max
,
E
(
Y
) and Median(
Y
) satisfy the following relationship.
Proposition 2.2
If
ρ
(
c
)
has a unique local maximum,i.e.,
ρ
0
(
c
) = 0
has a unique solution,
1. if
E
(
Y
)
> M edian
(
Y
)
then
c
max
> E
(
Y
)
,
2. if
E
(
Y
)
< M edian
(
Y
)
then
c
max
< E
(
Y
)
,
3. if
E
(
Y
) =
M edian
(
Y
)
then
c
max
=
E
(
Y
)
.
Hence,
c
max
is related to the skewness in
Y
. The proof appears in Appendix B.2.
c
corr(Y1,I(Y1>c))
4.5 5.0 5.5 6.0 6.5
0.5
0.6
0.7
0.8
0.9
1.0
median mean
c
corr(Y2,I(Y2>c))
4.5 5.0 5.5 6.0 6.5
0.5
0.6
0.7
0.8
0.9
1.0
median mean
c
corr(Y3,I(Y3>c))
4.5 5.0 5.5 6.0 6.5
0.5
0.6
0.7
0.8
0.9
1.0
median mean
c
corr(Y4,I(Y4>c))
4.5 5.0 5.5 6.0 6.5
0.5
0.6
0.7
0.8
0.9
1.0
median mean
c
corr(Y5,I(Y5>c))
4.5 5.0 5.5 6.0 6.5
0.5
0.6
0.7
0.8
0.9
1.0
median mean