• No results found

CHAPTER 3. DEVELOPMENT OF ALGORITHM ECONOMIC RATIO (ER) AS

3.4. Expanding ER as a Target Function

3.4.2. Tree construction algorithm

A tree construction algorithm selects a descriptor for a branch node that meets an optimal splitting criterion such as minimal ERc (ER as branch target function), maximal CCRc, or maximal decrease of Gini impurity. That is, the consequence of choosing a binary (0 or 1) descriptor is: PR, NR = respectively the number of positive (active) or negative (inactive) compounds with the fragment and placed in the right side of a branch; and PL, NL = respectively the number of positive (active) or negative (inactive) compounds without the fragment and placed in the right side of a branch.

ERc could be optimized (minimized) directly by equation (3.2) using only the rate of compounds with the fragment that are actually hits (PR) and the rate of compounds with the

51

fragment that are actually not hits (NR). However, we have found that such simplistic use of

ERc can create a highly skewed tree like the one shown in Figure 3.1 based on the PGP dataset. Descriptor instances of this dataset were only 7% of all possible instances, yielding a sparse QSAR matrix. To a depth of four tests, a total of 63 compounds were allocated in the active leaves. All of the 63 actually were hits. Consequently ER = 1.00 (perfect) with coverage (the ratio of total predicted positives to modeling compounds) = 0.17. This tree predicted a hit by a sequence of tests for absence or presence of four fragment descriptors.

To achieve a more balanced tree with optimization of ERc, we instead can select the descriptor that maximizes the following function ∆ER over all branch choices.

L R c ER ER w ER = × 1 − 1 ∆ (3.3)

Here w is a weight function defined at the branch as the smaller of two numbers: PR+NR and PL+NL, i.e., w = min (PR+NR, PL+NL). ERR is the ERc of the right child node, and ERL is

that of the left child node. The employment of w here biases descriptor choice toward those descriptors that allow many active cases and also have low ERc, not those that simply minimize ERc. We designate the target function defined by equation (3.3) as WERc

(Weighted ERc), which can generate a nearly balanced tree using the same PGP data. In

Figure 3.2, to a depth of four tests, a total of 122 compounds were allocated in the active leaves. 103 actually were hits. Consequently ER was 1.18 with coverage of 0.33. This tree (Figure 3.2) was more balanced than the tree built with ERc (Figure. 3.1).

The tree grows from the root node by repeatedly applying the following steps to each node (based on CART algorithm[29]). A descriptor with small w (such as five or less, as in this paper) is considered to be unsatisfactory. Also unsatisfactory is any descriptor with an

undesirable target function value (such as ER for WERc application is shown in Fig

The Correct Classification Rate (CCR) is simply the average of sensi specificity. Thus:

=

c

CCR

Again addressing the PGP dataset, at each branch a descriptor maximize the CCRc function

total of 107 compounds were predicted to be hits, of which 88 actually were hits. Consequently ER = 1.22 with coverage = 0.29.

Gini impurity is a measure of the cost of misclassifying a randomly chosen compound from the set. It is used as the default target function in many popular decision tree methods such as Random Forest[30]. A detailed explanation

et al.[29], where Gini impurity is calculated by:

( )

, t i j i

=

The Gini splitting criterion is choice of a descriptor with maximum decrease of impurity: ) , ( ts i = ∆

where in equations (3.5) and (

the cost of mis-classifying a compound in class

is the probability of classifying a compound to class and PR are probabilities of sending a compound

and without the descriptor to the right child node 52

undesirable target function value (such as ERc of a leaf > 1.7, as in this paper). application is shown in Figure 3.3.

The Correct Classification Rate (CCR) is simply the average of sensi

      + + + × = L L L R R R P N N N P P 2 1 (3.4)

Again addressing the PGP dataset, at each branch a descriptor can be

function, resulting in the tree in Figure 3.4. To a depth of four tests, a total of 107 compounds were predicted to be hits, of which 88 actually were hits. Consequently ER = 1.22 with coverage = 0.29.

Gini impurity is a measure of the cost of misclassifying a randomly chosen compound It is used as the default target function in many popular decision tree methods . A detailed explanation of Gini impurity can be found in Breiman, , where Gini impurity is calculated by:

) | ( ) | ( ) | (i j p i t p j t C

(3.5)

The Gini splitting criterion is choice of a descriptor with maximum decrease of

) ( ) ( ) (t PLi tL PRi tR i − − = (3.6) 5) and (3.6), i(t) is the impurity measure of node classifying a compound in class j as a class i compound. Obviously, is the probability of classifying a compound to class i given that it arrives at

are probabilities of sending a compound with the descriptor to the left child node the descriptor to the right child node tR respectively.

> 1.7, as in this paper). A flowchart

The Correct Classification Rate (CCR) is simply the average of sensitivity and

can be chosen to 4. To a depth of four tests, a total of 107 compounds were predicted to be hits, of which 88 actually were hits.

Gini impurity is a measure of the cost of misclassifying a randomly chosen compound It is used as the default target function in many popular decision tree methods of Gini impurity can be found in Breiman,

The Gini splitting criterion is choice of a descriptor with maximum decrease of

impurity measure of node t, and is compound. Obviously, =0. arrives at node t; PL

53

Our goal in tree construction can be thought of as a tree without excessive depth and with several leaves that are entirely or mostly hits, all with acceptable coverage of compounds. For example, designers at a certain stage of drug discovery might require at least 10% of compounds to be hits in addition to low ERv values.

Related documents