Assistance Bioassay Selection - Machine learning methods for better drug prioritization

The ideal assistance bioassays for the target bioassay Bi are expected to provide

auxiliary information, carried out by consequential assistance compounds, with which Bi’s ranking model RMican be improved. A key question here is what such auxiliary

information could be in the context of ranking. In active learning for classification, auxiliary information could be additional strong positive/negative signals that help bias the classification boundary toward the right direction. In the context of regression, auxiliary information could be additional data samples that help better reveal the underlying data distribution. Unfortunately, such options from classification and regression do not directly apply for ranking as ranking focuses on the ordinal relations across multiple instances. Thus, we expect that auxiliary information from assistance bioassays could be the information that helps strengthen, remedy or reconstruct the desired reference/ordinal structures among the compounds in the target bioassay. Furthermore, assistance bioassays should be the ones that sufficiently exhibit such information.

In order to identify sensible assistance bioassays, we need quantitative measure- ments to evaluate how much auxiliary information each candidate bioassay carries. However, it is non-trivial to quantify such information content and volume. Instead, we surrogate them by the similarity between the target bioassay and a candidate assistance bioassay in terms of their ordinal structures, under the hypothesis that if two bioassays are significantly similar in their ordinal structures, one of them carries auxiliary information for the other. In specific, we develop the similarities by comparing the ordinal structures from the following two aspects:

• How the target bioassay Bi’s model RMiperforms on the candidate assistance bioas-

say. This method represents an indirect comparison of the ordinal structures; and • How the target bioassay and the candidate assistance bioassay are similar in their compounds and compound rankings. This is a direct comparison of the ordinal structures.

These two similarities lead to the following two assistance bioassay selection schemes: cross-ranking based assistance bioassay selection, and profiling based assistance bioassay selection, respectively. From all the candidate bioassays, we select the assistance bioassays into a set denoted as B_i+, where all the assistance bioassays have the re- spective bioassay similarities that fall in 98 percentile of all the bioassay similarities.

2.4.1 Cross-Ranking based Bioassay Similarities

The first bioassay similarity measurement is inspired by our previous work that has been applied for target fishing [52]. The idea is, for the target bioassay Bi, if its ranking

model RMi performs well on another bioassay Bj, then Biand Bj are similar in terms

of their ranking structures. The underlying assumption is that model RMi captures

and models the signals from Bi’s compound ranking, and the good performance of

RMion Bj indicates that such signals align well with those from Bj’s ranking. Under

this assumption, the problem further boils down to measuring the performance of RMi on Bj. Such cross-ranking based assistance bioassay selection scheme is denoted

as bSimx.

To measure the performance of RMi on Bj, we devise the follow two approaches:

1). the first approach, denoted as bSimci

x , relies on a standard ranking evaluation

metric; and 2). the second approach, denoted as bSimal

x , utilizes sequence-alignment

based ranking comparison. Note that in bSimxwe only use RMion Bjin order to select

assistance bioassays to improve RMi. We don’t use RMj on Bi for RMi improvement

purposes because in addition to RMi, it requires the availability of Bj’s model RMj,

Concordance Indexing for bSimx (bSimcix )

We first use concordance index (CI; will be discussed later in Section 2.7.2) to evaluate the ranking performance of RMi on Bj. In this case, RMi ranks Bj into a

ranking list ˜ri→j, and CI is then calculated on ˜ri→j with respect to Bj’s true ranking

list rj. The higher the CI is, the better the ranking model RMi can predict the

ranking relations in Bj, and thus the more similar Bi and Bj are. Please note that

the similarities calculated from bSimci

x are not necessarily symmetric because the CI

calculated from ˜ri→j (i.e., ranking that RMi produces for Bj) and ri is not necessarily

the same as the CI calculated from ˜rj→i (i.e., ranking that RMj produces for Bi) and

rj.

Ranking Alignment for bSimx (bSimalx )

The concordance index CI measures the entirety of the ranking structures. How- ever, it is possible that only a certain portion of the ranking structures in Bj will

help, while CI cannot indicate such scenarios. Thus, we develop an alignment based ranking performance measurement bSimal

x (details on ranking list alignment will be

discussed later in Section 2.7.3). The key idea of bSimal

x is to identify locally con-

served ranking structures among ˜ri→j and rj. If the alignment reveals strong block

structures between ˜ri→j and rj, it indicates that RMi is able to reproduce a certain

chunk of orderings in rj, which would be considered for auxiliary information.

2.4.2 Profiling based Bioassay Similarity

The second bioassay similarity measurement is based on the comparison of compound profiles of two bioassays without modeling any of them. If two bioassays have similar rankings over similar compounds, we consider them as similar and hypothesize that they carry useful information that can be utilized to assist each other. Under this hypothesis, the problem can be casted to that, for bioassay Bi and Bj, we compare

the two ranking lists ri and rj. We develop the following two approaches for ranking

list comparison: 1). the first approach, denoted as bSimal

p , compares two ranking

lists riand rj using alignment; and 2). the second approach, denoted as bSimcsp , com-

pares two sets of compounds Ciand Cjregardless of ranking structures. The approach

bSimcs_p is for approach comparison purposes and to make the study complete.

Ranking Alignment for bSimp (bSimalp )

The key idea in profiling-based ranking alignment approach bSimal

p is very similar

to that of bSimal

x , that is, to measure how similar two rankings are. In specific, we

look at to what extent similar compounds are ranked in similar orders. However, in bSimal_p , instead of aligning ˜ri→j and rj as in bSimalx , we align ri and rj and use the

alignment to measure the similarity between Bi and Bj.

Compound Similarities for bSimp (bSimcsp )

In bSimcs

p , we compare Bi and Bj by looking at how similar their compounds are,

and thus, the similarity between Biand Bjis calculated as the average compound sim-

ilarities (Compound similarity will be discussed later in Section 2.7.1). This approach ignores the ranking ordering among the compounds.

In document Machine learning methods for better drug prioritization (Page 38-41)