A Copula Based Approach for Design of Multivariate Random Forests for Drug Sensitivity Prediction

(1)

DigitalCommons@University of Nebraska - Lincoln

Faculty Publications, Department of Statistics Statistics, Department of

2015

A Copula Based Approach for Design of Multivariate Random

Forests for Drug Sensitivity Prediction

Saad Haider Raziur Rahman Souparno Ghosh Ranadip Pal

Follow this and additional works at: https://digitalcommons.unl.edu/statisticsfacpub

Part of the Other Statistics and Probability Commons

This Article is brought to you for free and open access by the Statistics, Department of at

DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Faculty Publications, Department of Statistics by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln.

(2)

A Copula Based Approach for Design of

Multivariate Random Forests for Drug

Sensitivity Prediction

Saad Haider1_{, Raziur Rahman}1_{, Souparno Ghosh}2_{, Ranadip Pal}1_*

1Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, Texas, United States of America,2Department of Mathematics and Statistics, Texas Tech University, Lubbock, Texas, United States of America

*[email protected]

Abstract

Modeling sensitivity to drugs based on genetic characterizations is a significant challenge in the area of systems medicine. Ensemble based approaches such as Random Forests have been shown to perform well in both individual sensitivity prediction studies and team sci-ence based prediction challenges. However, Random Forests generate a deterministic pre-dictive model for each drug based on the genetic characterization of the cell lines and ignores the relationship between different drug sensitivities during model generation. This application motivates the need for generation of multivariate ensemble learning techniques that can increase prediction accuracy and improve variable importance ranking by incorpo-rating the relationships between different output responses. In this article, we propose a novel cost criterion that captures the dissimilarity in the output response structure between the training data and node samples as the difference in the two empirical copulas. We illus-trate that copulas are suitable for capturing the multivariate structure of output responses independent of the marginal distributions and the copula based multivariate random forest framework can provide higher accuracy prediction and improved variable selection. The proposed framework has been validated on genomics of drug sensitivity for cancer and can-cer cell line encyclopedia database.

Introduction

An important goal of systems medicine is to generate genomics informed personalized thera-peutic regimes with higher efficacy. The ability of inferred models to accurately predict sensi-tivity of an individual tumor to a drug or drug combination can assist in designing

personalized cancer therapy treatments with expected effectiveness significantly higher than current standard of care approaches. A variety of techniques have been proposed for drug sen-sitivity prediction based on genetic characterizations. A common approach is to consider a training set of cell lines with experimentally measured genomic characterizations (RNA expres-sion, Protein Expresexpres-sion, Methylation, SNPs etc.) and response to different drugs, and design a11111

OPEN ACCESS

Citation:Haider S, Rahman R, Ghosh S, Pal R (2015) A Copula Based Approach for Design of Multivariate Random Forests for Drug Sensitivity Prediction. PLoS ONE 10(12): e0144490. doi:10.1371/journal.pone.0144490

Editor:Attila Gursoy, Koc University, TURKEY

Received:March 13, 2015

Accepted:November 19, 2015

Published:December 10, 2015

Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability Statement:The data used in the manuscript is available fromhttp://www.

cancerrxgene.org/. The Matlab implementation of the algorithms can be downloaded fromhttps://github. com/sahaider/copulamrf.

Funding:This work was supported by National Science Foundation Grant CCF 0953366. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests:The authors have declared that no competing interests exist.

(3)

supervised predictive models for each individual drug based on one or more genomic charac-terizations. For instance, statistical tests have been used to show that genetic mutations can be predictive of the drug sensitivity in non-small cell lung cancers [1]. In [2], gene expression pro-files have been used to predict the binarized efficacy of a drug over a cell line with an accuracy ranging from 64% to 92%. In [3], a co-expression extrapolation (COXEN) approach was used to predict the drug sensitivity for samples outside the training set with an accuracy of around 82% and 75% in predicting the binarized sensitivity of bladder and breast cancer cell lines respectively. Tumor sensitivity prediction has also been considered as (a) a drug-induced topology alteration [4] using phosphor-proteomic signals and prior biological knowledge of generic pathway and (b) a molecular tumor profile based prediction [1,5]. Drug sensitivity pre-diction using an elastic net regression analysis [6] over more than 100,000 genomic features (RNA expression, Mutational status of specific genes and SNPs) was considered in [7]. The correlation coefficients between the predicted and actual sensitivity over 450 cell lines using 10 fold cross validation ranged from 0.08 to 0.76 for different targeted drugs. [8] used a Random Forest (RF) based approach to tumor prediction in the NCI 60 cell lines with performance exceeding multiple existing approaches.

The motivation towards Multivariate random forests originated from our participation in a recent community based effort organized by Dialogue on Reverse Engineering Assessment and Methods (DREAM) project [9] and National Cancer Institute (NCI) that explored multiple dif-ferent drug sensitivity prediction algorithms applied to a common dataset. More than 40 differ-ent approaches were applied and our submission based on Random Forests (RF) that

considered the generation of individual models for each drug was a top performer in the chal-lenge [10]. However, sets of drugs can have common targets or paths of action resulting in cor-related responses in their sensitivities, which can possibly be utilized to improve the accuracy of the supervised predictive model. Note that the best performing approach in this challenge considered the relationships in the output responses in the form of Bayesian multitask learning [9] and the details of this multi-output regression approach is available at [11]. Multi-drug model has also been pursued by [12] where they have used multi-output regression using neu-ral networks on the Genomics of Drug Sensitivity for Cancer (GDSC) dataset. Since our top performing RF model was ignoring the multi-drug response dependencies, we investigated the extension of the RF framework to Multivariate Random Forests (MRF) that incorporates the relationships between the output sensitivities. The objective of the MRF framework is to gener-ate predictions that minimize error and have a multivarigener-ate structure similar to the relation-ships in the original training output responses. To generate individual multivariate regression trees for the construction of MRF, we altered the node cost function to consist of the weighted sum of the squared differences from the mean (similar to univariate regression tree cost func-tion) and a penalty term to capture the difference between the multivariate relationship in the output responses at the node and the multivariate relationship observed in the original training data. Our initial choice for creating the regression tree node cost was to use Mahalanobis dis-tance square [13], which improved our results as compared to RF approach [14]. The Mahala-nobis distance square, being based on the covariance of the output responses, is suitable for scenarios where the relationships between the drug sensitivities is linear but can fail to capture non-linear relationships with low correlation coefficients. With this consideration, this article explores the design of multivariate regression trees that can capture all types of relationship in the output responses.

To capture the multivariate structure present in the output responses, we consider the use of copulas as they can deconstruct a multivariate distribution into its marginal distributions and underlying relationships that are represented by copula functions. We expect that the multivar-iate distribution of the sensitivities to a drug set will change based on the type of cell lines they

(4)

are being applied to but the relationship structure separated from the marginal sensitivity dis-tributions will remain similar. As an example, consider two drugs Gefitinib and Lapatinib that might have higher sensitivities when applied to breast cancer cell lines but lower sensitivities when applied to brain tumor cell lines. Thus, the multivariate distribution representing the sen-sitivities to the two drugs will appear to be skewed towards higher values for Breast cancer cell lines and skewed towards lower values for Brain tumor cell lines. However, we might observe similar correlation coefficients between the sensitivities for the Breast cancer cell lines and the Brain tumor cell lines as the primary target of both the drugs (EGFR) maintains the relation-ship. The correlation coefficient is one of the measures of the multivariate structure that will mostly capture linear relationships. However, incorporating the ability to separate the marginal distributions from the multivariate distribution will provide us with a more detailed represen-tation of the underlying associations.

In this article, we discuss the appropriateness of copulas for capturing the multivariate structure in output responses and subsequently propose a cost function utilizing copulas for evaluating multivariate regression tree node splits. The cost function is a weighted combination of (a) the sum of squares of the differences between the node and mean responses and (b) the difference in the empirical copula observed at the node and the copula representing the train-ing samples. We also demonstrate the suitability of the framework in variable selection where it provides higher importance to biologically relevant features as compared to competing approaches.

Note that the generation of the node cost function based on copulas presented in this paper can be considered as a generalization of the Multivariate Random Forest framework based on the square of Mahalanobis distances [13]. The presented approach can be applied to any pre-dictive modeling scenario with multiple interrelated output responses.

The paper is organized as follows: TheMethodssection provides a description of the Ran-dom Forest framework with proposed extensions to copula based Multivariate RanRan-dom Forests including design of the node cost function and an illustrative example. TheResultssection con-tains the performance of the proposed approach when applied to Genomics of Drug Sensitivity in Cancer database. TheConclusionssection presents the conclusions of the current study and discusses future directions.

Methods

We first present a description of Random Forest regression followed by extension to Multivari-ate Random Forest regression utilizing the covariance structure of the data. Subsequently, the concept of Copulas is introduced along with their application in designing node splits for mul-tivariate regression trees.

Random Forest Regression

Random Forest (RF) regression refers to ensembles of regression trees [15] where a set ofT un-pruned regression trees are generated based on bootstrap sampling from the original training data. For each node, the optimal node splitting feature is selected from a set ofmfeatures that are picked randomly from the totalMfeatures. FormM, the selection of the node splitting feature from a random set of features decreases the correlation between different trees and thus, the average response of multiple regression trees is expected to have lower variance than individual regression trees. Largermcan improve the predictive capability of individual trees but can also increase the correlation between trees and void any gains from averaging multiple predictions. The bootstrap resampling of the data for training each tree also increases the varia-tion between the trees.

(5)

Process of splitting a node

Letxtr(i,j) andy(i) (i= 1, ,n;j= 1, ,M) denote the training predictor features and output

response samples respectively. At any nodeηP, we aim to select a featurejsfrom a random set of mfeatures and a thresholdzto partition the node into two child nodesηL(left node with

sam-ples satisfyingxtr(I2ηP,js)z) andηR(right node with samples satisfyingxtr(i2ηP,js)>z).

We consider the node cost as sum of square differences:

DðZPÞ ¼

X

i2Z_P

ðyðiÞ mðZPÞÞ

2 _ð₁_Þ

whereμ(ηP) is the expected value ofy(i) in nodeηP. Thus the reduction in cost for partitionγ

at nodeηPis

Cðg;ZPÞ ¼DðZPÞ DðZLÞ DðZRÞ ð2Þ

The partitionγthat maximizesC(γ,ηP) for all possible partitions is selected for nodeηP.

Note that for a continuous feature withnsamples, a total ofnpartitions needs to be checked. Thus, the computational complexity of each node split isO(mn). During the tree generation process, a node with less thannsizetraining samples is not partitioned any further.

Forest Prediction

Using the randomized feature selection process, we fit the tree based on the bootstrap sample {(X1,Y1),. . ., (Xn,Yn)} generated from the training data.

Let us consider the prediction based on a test samplexfor the treeΘ. Letη(x,Θ) be the par-tition containingx, the tree response takes the form [15–17]:

yðx;YÞ ¼X n

i¼1

wiðx;YÞyðiÞ ð3Þ

where the weightswi(x,Θ) are given by wiðx;YÞ ¼

1fxtrðiÞ2Zðx;YÞg #fr:xtrðiÞ 2ZðxtrðrÞ;YÞg

ð4Þ Let theTtrees of the Random forest be denoted byΘ1, ,ΘTand letwi(x) denote the

aver-age weights over the forest i.e.

wiðxÞ ¼ 1 T XT j¼1 wiðx;YjÞ: ð5Þ

The Random Forest prediction for the test samplexis then given by

yðxÞ ¼X n

i¼1

wiðxÞyðiÞ ð6Þ

Multivariate Random Forest (MRF)

Let us now consider the multiple response scenario with outputy(i,k)(i= 1, ,n;k= 1, ,r). The primary difference between MRF and RF is in generation of the trees with different node costsDm(η) andD(η) [13].

(6)

The node costD(ηP) =∑i2ηP(y(i)−μ(ψP))2for the univariate case is the sum of squares of

the differences between the output response and the mean output response for the node. For multivariate case, we would like to use a multivariate node cost that calculates the difference between a sample point and the multivariate mean distribution. One possible measure is the sum of the squares of Mahalanobis Distances [18] as shown next:

DmðZPÞ ¼ X i2ZP ðyðiÞ μðZPÞÞL 1_ð_y_ð_i_Þ_μ_ð_Z PÞÞ T _ð 7Þ whereΛis the covariance matrix,y(i) is the row vector (y(i, 1), ,y(i,r)) andμ(ηP) is the row

vector denoting the mean ofy(i) in nodeηP. The inverse covariance matrix (Λ−1) is a precision

matrix [19] which is helpful to test conditional dependence between multiple random vari-ables. The Mahalanobis distance square normalizes the output responses by their standard deviations and in case ofΛbeing diagonal, it represents the normalized Euclidean distance. For bivariate case with covarianceΛ, the node cost is increased when the deviations of the two out-put responses from the mean responses are in opposite directions.

Since, the Mahalanobis distance captures the distance of the sample point from the mean of the node along the principal component axes, it might be unable to capture nonlinear relation-ships that produces a closer to diagonal Covariance matrix. Thus, our objective is to introduce Copulas to capture the nonlinear multivariate structure.

Copula Description

Copulas can represent the dependence between multiple random variables independent of the marginal distributions. A copula function [20] is used to map the joint cumulative probability distribution in terms of the marginal cumulative probability distributions. LetC1,C2. . .CN

rep-resentNreal valued random variables uniformly distributed on [0, 1]. CopulaC: [0, 1]N![0, 1] with parameterθis defined as:

Cyðu1;u2:::uNÞ ¼PðC1u1;C2u2:::CN uNÞ ð8Þ

The multivariate cumulative probability distributionFX(x1,x2,. . .xN) and the marginal

cumulative probability distributionsFi(xi) for (i2{1, 2,. . .N} are related by Sklar’s theorem

[20] as follows:

FXðx1;x2; :::xNÞ ¼CðF1ðx1Þ;F2ðx2Þ:::FNðxNÞÞ ð9Þ

If the marginal cumulative distributions (Fi(x)) are continuous, copulaCis unique [20].

Some copulas can be parameterized using few parameters, for instance, the clayton copula [21] for bivariate distribution is defined as follows using parameterξ:

Cðu₁;u₂;xÞ ¼ ðu₁xþu₂x1Þ1=x ;x2 ð0;1Þ ð10Þ Similarly, the copula characterizing two independent variables will have the formC(u1,u2) =

u1u2. Some other common forms of parameterized copulas include Gaussian Copula [22], Frank Copula [23], student’s t-copula [24] and Gumbel copula [25]. However, the standard forms of parameterized copulas may not capture all forms of relationships. We can consider the use of empirical copulas that are estimated directly from the cumulative multivariate distri-bution. Note that the calculation of empirical copulas will have higher computational complex-ity than parameterized copulas but they can capture a broad range of relationships. We utilize empirical copulas to represent our multivariate structures.

(7)

Node Split Criteria using Copula

As described earlier, the regression tree generation process involves partition of a node into two branches based on optimizing a cost criterion. The node cost for univariate regression trees is given byEq 1and the node cost for multivariate regression trees utilizing Mahalanobis distance is shown inEq 7. The feature and threshold that results in maximum cost reduction for that node is selected for splitting.

We next discuss the design of node cost function based on copulas to capture the output dependencies. We expect that the dependency structure among the samples in a node should be similar to the dependency structure observed in the original training data. Consider the nodeηPwithNPsamples and letCdenote the integral of the difference in the empirical

copu-las observed at nodeηPand the root node (this is same as the empirical copula for the

train-ing data). We design the node costDC(ηP) for a copula based multivariate regression tree as

follows: DCðZPÞ ¼D1þaD2 where ð11Þ D1¼6NPC and D₂¼X r j¼1 P i2Z_P_;_j½yði;jÞ mðZP;jÞ2 s2 j ð12Þ

whereαdenotes a scaling factor determining the relative weight of the two components of the node cost,ηP,jforj2{1, ,r} denotes the set ofjth output responses at nodeηPands2j

forj2{1, ,r} denotes the variance of thejth output response at root node.

We next present the motivation for selecting the weight 6 for the integral of copula distance along with approaches to select the scaling factorα. For maintainingD1andD2in the same range, we analyzed the range ofCas compared toD2.

Hereafter, the MRF approach that uses copula based node splitting criteria (based onEq 11) will be termed asCMRFand the MRF approach using covaraince based node splitting criteria (based onEq 7) will be termed asVMRF.

Analyzing integral of differences in copulas

We first analyze the upperbound on the integral of the difference between two bivariate copulas and subsequently explore further multivariate copulas. Based on Frechet-Hoeffding bounds [26], any bivariate copulaC(u,v) is bounded by the following:

CLðu;vÞ ¼max½uþv1;0 Cðu;vÞ min½u;v ¼CUðu;vÞ

Thus for any two copulasC1(u,v) andC2(u,v), we have

jC1ðu;vÞ C2ðu;vÞj CUðu;vÞ CLðu;vÞ 8u;v2 ½0 1 Consequently, Z 1 v¼0 Z 1 u¼0 jC₁ðu;vÞ C₂ðu;vÞjdudv Z 1 v¼0 Z 1 u¼0 ½CUðu;vÞ CLðu;vÞdudv

Using the two diagonals in the unit square (u=vandu+v= 1), we can divide the unit square into four triangles where the values ofCU(u,v) andCL(u,v) are simple functions ofu

(8)

andv. For region 1, we haveu>vandu+v>1 and

CUðu;vÞ CLðu;vÞ ¼v ðuþv1Þ ¼1u

For region 2, we haveu>vandu+v1 and

CUðu;vÞ CLðu;vÞ ¼v0¼v

Similarly for region 3, we haveuvandu+v1 and

CUðu;vÞ CLðu;vÞ ¼u0¼u

And for region 4, we haveuvandu+v>1 and

CUðu;vÞ CLðu;vÞ ¼u ðuþv1Þ ¼1v

The integral over region 1 is as follows: Z Z Region1ðCUðu;vÞ CLðu;vÞÞdu dv ¼ Z 1 v¼0:5 Z 1 u¼v ð1uÞdudvþ Z 0:5 v¼0 Z 1 u¼1v ð1uÞdudv¼1=24

We can likewise show that the value of the integral for each of the there other regions is₂₄1. ThusR_v1_¼₀R_u1_¼₀ðCUðu;vÞ CLðu;vÞÞdudv¼1=6Thus, the upper bound on the surface integral

of the difference between any two bivariate copulas is 1/6. Similarly, if we consider the indepen-dent copulaCI(u,v) =uv, then we can show that

R1

v¼0

R1

u¼0ðCUðu;vÞ CIðu;vÞÞdudv¼1=12.

Forn>2, we conducted simulations to estimate the value of the integrals which is shown in

Table 1.

Thus, since the upper bound ofR(CU−CL) lies in the range of 1/6 to 0.21,D1= 6NPCwill

be upper bounded byNPfor a bivariate copula. If the regression tree is unable to reduce the

ini-tial variance in each output response, the value ofD2will be in the range ofrNPas thejth

numerator term will be close toNPs2j. However, since the regression tree will likely reduce the

variance in the output response at nodes further away from the root, the value ofD2will be much lower thanrNP.

Selection of

α

Our previous analysis ofCprovided a range of the integral difference between two copulas but was unable to provide a weight factor for combiningD1andD2that is optimal in terms of pre-dictive performance. We expect that the behavior ofD1andD2will change significantly for dif-ferent training datasets and thus we selectαbased on each training dataset. We next describe two techniques to select the weight factorαto achieve higher prediction accuracy.

Method 1: Evaluating and selecting from a set ofα’s. This is a straightforward approach where different values ofα(we considered 10 values ofαspaced between 0.1 to 10) are

Table 1. Integral of Copula Differences for different dimensions.

Dimensions R_CU−CL RCU−CI RCU

2 1₆ ₁₂1 1₃

3 0.2093 0.1255 0.252

4 0.197 0.1425 0.2072

(9)

evaluated and the one with best predictive performance selected. The original training data is sub-divided into secondary training and secondary testing samples. The secondary training samples are used to create MRF models that are used to find prediction of secondary testing samples. This process is repeated for a set of possible values ofα. The correlation coefficient between predicted secondary testing samples and original secondary testing samples are recorded and theαcorresponding to highest correlation coefficient is selected (αS). ThisαSis

then used to create MRF model using the original training samples and tested on original test-ing samples. For our examples, we have applied 10 fold CV on the original data and thus for each fold of training data, we may select differentα. However, for each specific fold,αwill be fixed for all the trees generated. The above method increases the computational complexity due to the evaluation of multiple values ofα. We next present another approach that attempts to reduce the evaluation of numerous values ofα.

Method 2: Pareto Frontier Approach to selectα. In this approach, we consider the node cost function minimization from a multi-objective optimization problem perspective where we aim to jointly minimize bothD1andD2. From the multi-objective perspective, if we plot theD2 vsD1for all possible feature and threshold combinations for a specific node (if we haven sam-ples at a specific node andmfeatures, the number of partitions to be evaluated ismn), we should select a feature and threshold combination that lies in the Pareto frontier. In other words, we look for solutions that are not dominated by any other solution: for instance if the

D1andD2values forwdifferent feature and threshold combinations are denoted by {1(i), 2(i)} fori2{1, ,w}, a combinationiis considered dominated byjif either (a)1(i)>1(j) and2(i)2(j) or (b)1(i)1(j) and2(i)>2(j) is valid. The feature and threshold combina-tions that are not dominated by any of the otherw−1 combinations form the Pareto Frontier. For instance,Fig 1(a)shows an example Pareto frontier (red circles) for the left child node for the first split of a specific tree (theD1andD2values are denoted byD1LandD2Lrespectively)

generated from a synthetic example described in next section. Similarly,Fig 2(a)shows the Pareto frontier (red circles) for the right child node for the first split of a specific tree (theD1 andD2values are denoted byD1RandD2Rrespectively).

Our idea is to approximate the Pareto frontier using straight lines and utilize the slope of the lines to designα. Figs1(b)and2(b)shows that the Pareto frontier can be approximated by two straight lines: one with slope greater than 1 and another with slope less than 1. Conse-quently, the value ofαcan be approximated by the following equation.

a¼ 1=r

whereρdenotes the slope of the straight linefitted to the Pareto frontier. Thus, we have 2 possi-ble values ofαfrom the veryfirst split of a specific tree. If we prepare scatter-plots for thefirst split of all the trees forα>1 andα<1, we arrive at plots similar toFig 3(a) and 3(b) respec-tively. In this method, we calculate the predictive performance of the two averageα’s and select the one with the best performance to design the overall forest.

An example to illustrate the appropriateness of Copula based node cost

for design of Multivariate Regression Trees

We have observed that multivariate random forests incorporating covariance (Mahalanobis distance square) between output responses is more suitable for predicting output responses with linear relationships as compared to output responses with non-linear relationships. Cop-ula presents a methodology to capture the non-linear dependence relationships between multi-ple variables; and we anticipate that copula will be suitable for predicting output drug

(10)

with non-linear relationship to investigate the performance of the proposed approach as com-pared to covariance based MRF design.

We consider a 50 × 10 input data matrix (50 samples and 10 features) denoted byXthat was created randomly from a normal distributionNð0;1Þ. We next generated two output responsesY1andY2based on functions of the input features. Let column vectorsxi(i= 1, 2,

, 10) denote the 10 features and the output responsesY1andY2are deﬁned as follows:

Y1¼2x1þ5x21:5x3þx4 ð13Þ

Y2 ¼ ðY1EðY1ÞÞ2 ð14Þ

Note that the output responses are dependent on only 4 features out of the 10 possible input features. Based on the relative weights,x2is the most weighted feature and should play a critical role while growing the trees at the beginning. Note thatY1andY2has a quadratic relationship.

We consider two multivariate regression trees trained on the same inputXand same output responses [Y1,Y2] but different node splitting criterion. The regression trees denoted byTree {V, [Y1,Y2]} andTree{C, [Y1,Y2]} are inferred using the covariance (Eq 7) and copula (Eq 11) based node cost functions respectively. For this example, all the features were considered at

Fig 1.D2vsD1for left node.(a) shows an example Pareto frontier (red circles) for the left child node for the

first split of a specific tree (theD1andD2values are denoted byD1LandD2Lrespectively). (b) shows that the Pareto frontier can be approximated by two straight lines: one with slope greater than 1 and another with slope less than 1.

(11)

each node i.e.m=Mand randomly chosen 80% of 50 samples with bootstrapping were used for generating the regression trees.

Fig 4shows two multivariate regression trees generated using copula (Eq 11) and covariance (Eq 7) based node cost functions.Fig 4illustrates that the splitting process for each tree is dependent on different features at each node, which eventually leads to two totally dissimilar trees. The empty circles denote leaf nodes; the circles enclosing a number signify a split node and the number inside the circle indicate the featured selected on that node for splitting.

We expect that the regression tree generation based on copula as compared to covariance will be able to better capture the non-linear relationship betweenY1andY2.Fig 4demonstrates that the copula basedTree{C, [Y1,Y2]} has selected the most significant features (features 2 and 1 that have the highest weights during generation ofY1andY2) while generating the multivari-ate regression tree. On the hand, the covariance based treeTree{V, [Y1,Y2]} trained on the same data selected a spurious feature 7 which was not involved in the generation of eitherY1orY2.

To visually compare the multivariate structure during regression tree splits, we plotted the cumulative distribution functions (CDFs) for the original data and after splitting using copula and covariance based node cost functions.Fig 5shows the original CDF and the CDFs at the left and right child nodes when the node split is based onEq 11(CMRF). Likewise,Fig 6shows the original CDF and the CDFs at the left and right child nodes when the node split is based on

Eq 7(VMRF). We observe that the node split using copula based node cost better maintains the CDF observed in the original data (Fig 5(b) and 5(c)are similar to (a)) as compared to the split using covariance based node cost (Fig 6(b)is significantly different from (a)).

Fig 2._D2vsD1for right node.(a) shows an example Pareto frontier (red circles) for the right child node for

the first split of a specific tree (theD1andD2values are denoted byD1RandD2Rrespectively). (b) shows that the Pareto frontier can be approximated by two straight lines: one with slope greater than 1 and another with slope less than 1.

(12)

Fig 3. Scatter plot ofα’s across the trees.(a) and (b) are scatter-plots for the first split of all the trees forα>1 andα<1 respectively.

doi:10.1371/journal.pone.0144490.g003

Fig 4. Two multivariate regression trees trained on the same input X and same output responses [Y1,Y2] but the node cost criteria being copula

based (Tree{C, [Y1,Y2]}) and covariance based (Tree{V, [Y1,Y2]}) respectively.The empty circles represent leaf nodes and the circles enclosing a

number signifies a split node; the number inside the circle indicates the featured selected on that node for splitting.

(13)

Variable Importance Measure (VIM). In this section, we consider the issue of feature selection for MRFs. We would like to generate and compare the Variable Importance Measure (VIM) for CMRF and VMRF. We expect that CMRF will have higher feature scores for the sig-nificant features as compared to VMRF. Typical variable importance measure for random forest considers the frequency of feature selection, out of bag error or permutation measures [27]. We consider the basic approach of calculating the number of times each feature gets selected and the VIM for each forest will be the sum of these frequencies across all trees normalized to the range between 0 and 1. Based on the synthetic data, we generated 100 Multivariate Regression Trees using CMRF (with fixedα) and VMRF with output responsesY1andY2and generated the variable importance of the 10 input features. The normalized variable importance scores reported inTable 2illustrate that the top four features selected by CMRF (X2,X1,X3,X4) are the same as the four features that were used to generateY1andY2using Eqs13and14respectively. Furthermore, the ordering of the scoresVIM(X2)>VIM(X1)>VIM(X3)>VIM(X4) is same as

Fig 5. CDF created from left and right child node for a single split using CMRF.It is compared visually with the original CDF created from the training samples.

(14)

the ordering of the absolute weights of the four features in generation ofY1whereX2has the largest weight followed byX1,X3andX4. On the other hand, the top four features selected by

VMRF X2,X1,X3,X6fails to pickX4and includes a spurious featureX6that was not involved in the generation of output responseY1. Thus, the example supports that copula based MRF might be better suitable to select top features as compared to covariance based MRF.

Fig 6. CDF created from left and right child node for a single split using VMRF.It is compared visually with the original CDF created from the training samples.

doi:10.1371/journal.pone.0144490.g006

Table 2. Variable importance measure calculated using_CMRFY1,Y2andVMRFY1,Y2.

Feature X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

CMRFY1,Y2 0.2440 0.3720 0.0952 0.0744 0.0179 0.0625 0.0238 0.0417 0.0327 0.0357 VMRFY1,Y2 0.2340 0.3700 0.0836 0.0529 0.0056 0.0729 0.0418 0.0418 0.0474 0.0501

(15)

Results

For analyzing the prediction capabilities of our framework, we considered two different data-sets: GDSC and CCLE. Both include genomic characterization of numerous cell lines and dif-ferent drug responses for each cell line. For the current analysis, we consider the gene

expression data as the genomic characterization information for both datasets. Area under the Curves (AUC) is used as representation of drug responses for both GDSC and CCLE. Both datasets are high dimensional in the number of features (gene expressions). For all perfor-mance comparison results presented in this article, a prior feature selection method (RELIEFF [28]) is applied to reduce the number of features to be used for training. A performance com-parison of random forest approaches with and without the application of prior feature selection is shown for GDSC dataset in Table A inS1 File.

For performance comparison purposes, we report results of Copula based MRF (CMRF) along with univariate RF (denoted by RF), Covariance based MRF (VMRF) and Kernelized Bayesian Multitask Learning (KBMTL) [11] approaches. KBMTL is Bayesian formulation that combines kernel based non-linear dimensionality reduction and regression in amultitask learningframework, that tries to solve distinct but related tasks jointly to improve overall gen-eralization performance. We have implemented KBMTL using the algorithmic code provided in [11]. Based on the parameters used in [11], we have considered 200 iterations and gamma prior values (bothαandβ) of 1. Subspace dimensionality has been considered to be 20 and the standard deviation of hidden representations and weight parameters are selected to be the default 0.1 and 1 respectively.

Results on GDSC Dataset

The GDSC gene expression and drug sensitivity dataset was downloaded from Cancerrxgene. org [29]. The dataset has 789 cell lines with gene expression data and 714 cell lines with drug response data. We considered the intersection of cell lines that had both drug response and gene expression data.

For our experiments, we consider four sets of drug pairs where three of them have common primary targets and the remaining pair has no common target. We expect that the drug pairs with common primary targets will have some form of relationship among their sensitivities and CMRF should perform better than VMRF and both should perform better than RF approach. On the other hand, the drug pair without any common targets is expected to have minimal relationship among the drug sensitivities and thus RF is expected to outperform CMRF and VMRF. We also present results on 3 drug set and 138 drug set for GDSC as Tables C, D and E inS1 File.

Initially each cell line has 22277 features (probeset) as gene expressions. We have reduced it to 500 for each drug response using RELIEFF [28] and used a union of the 500 features in each of the four sets of drugs.

The first selected setSC2consisting of {Erlotinib,Lapatinib} has common targetEGFR[30–

32]. The second setSC3consisting of {AZD-0530,TAE-684} has common targetABL1[32]. The third setSC1was {AZD6244,PD-0325901} with common targetMEK[32–34]. The fourth setSU

consisting of {17-AAG, Erlotinib} has no common target.

As mentioned earlier, each drug has some missing responses across the 714 cell lines. The drug setsSC1,SC2,SC3andSUhave drug responses in both drugs for 316, 349, 645 and 300 cell

lines respectively. To report our results, we compared 5 fold cross-validated Pearson correla-tion coefficients, Mean Absolute Error (MAE) and Normalized Root Mean Square Error (NRMSE) between predicted and experimental responses for RF, VMRF, CMRF and KBMTL.

(16)

NRMSE of drugmcan be calculated as [11]: NRMSEm¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðym^ymÞ T_ð ym^ymÞ ðym1EðymÞÞ T_ð ym1EðymÞÞ s ð15Þ whereymand^ymdenote the vector of actual and predicted drug sensitivities respectively andE

(ym) denote mean of vectorym. For both VMRF and CMRF, we set the minimum size of

sam-ples in each leaf tonsize= 5, the number of trees in the forest toT= 150 and the splitting in

each node considersm= 10 random features.

The correlation coefficients using 5 fold cross validation error estimation are illustrated for each drug set inTable 3. The corresponding MAE and NRMSE behaviors are illustrated in

Table 4.

For CMRF, results with scaling factorαselected usingMethod-1discussed earlier has been used. The robustness analysis ofαusing synthetic data is conducted usingMethod-2and is shown as Tables H and I inS1 File.Table 3shows that CMRF outperformed (in terms of cor-relation coefficients) VMRF, RF and KBMTL for the related drug pairsSC1,SC2,SC3whereas CMRF is outperformed by the other approaches for the unrelated drug pairSU.Table 4

shows thatCMRFoutperforms VMRF, KBMTL and RF in terms of average NRMSE for the related pairs of drugsSc1,Sc2andSc3. For the unrelated pairSU, univariate RF outperforms the

Table 3. 5 fold CV results for GDSC Dataset drug sensitivity prediction for four drug sets in the form of correlation coefficients._VMRF,_CMRF repre-sent Multivariate Random Forest using Covariance and Copula respectively.KBMTLrepresents Kernelized Bayesian multitask learning (Parameters consid-ered are 200 iterations,α=β= 1 and subspace dimensionality = 20).

Correlation Co-ef_ﬁcients

Drug Set Common Target Drug Name RF VMRF CMRF KBMTL

SC1 EGFR Erlotinib 0.5156 0.5193 0.5301 0.2500 Lapatinib 0.5544 0.5742 0.5699 0.1132 SC2 ABL1 AZD-0530 0.3553 0.3810 0.3990 0.3181 TAE-684 0.4060 0.4100 0.4338 0.2420 SC3 MEK AZD6244 0.4625 0.4508 0.4590 0.0950 PD-0325901 0.5890 0.6022 0.6016 0.3236 SU None 17-AAG 0.6304 0.6244 0.6167 0.4375 Erlotinib 0.5859 0.5906 0.5708 0.4081 doi:10.1371/journal.pone.0144490.t003

Table 4. 5 fold CV results for GDSC Dataset drug sensitivity prediction for four drug sets in the form of MAE and NRMSE for RF, VMRF, CMRF and KBMTL approaches.

MAE NRMSE

Drug Set Common Target Drug Name RF VMRF CMRF KBMTL RF VMRF CMRF KBMTL

SC1 EGFR Erlotinib 0.0319 0.0322 0.0314 0.0503 0.8733 0.8749 0.8719 1.3365 Lapatinib 0.0292 0.0294 0.0286 0.0488 0.8516 0.8459 0.8486 1.3538 SC2 ABL1 AZD-0530 0.0446 0.0448 0.0442 0.0613 0.9407 0.9378 0.9291 1.2344 TAE-684 0.0829 0.0829 0.0821 0.1159 0.9285 0.9299 0.9195 1.3698 SC3 MEK AZD6244 0.0584 0.0590 0.0584 0.1138 0.8949 0.9034 0.8962 1.8016 PD-0325901 0.0723 0.0727 0.0717 0.1199 0.8263 0.8230 0.8193 1.4028 SU None 17-AAG 0.0584 0.0590 0.0584 0.1198 0.7840 0.7894 0.7955 1.1624 Erlotinib 0.0723 0.0727 0.0717 0.0410 0.8335 0.8441 0.8505 1.1013 doi:10.1371/journal.pone.0144490.t004

(17)

multivariate approaches for both average correlation coefficients and NRMSE. The scatter plots of predicted response vs original response for drugsetSC1using RF, VMRF, CMRF are shown inFig 7.

Results on CCLE Dataset

The CCLE [35] database includes genomic characterization for 1037 cell lines and drug responses over 24 drugs for over 480 cell lines. For the purpose of predicting responses, 4 sets of drugs were selected. The first setSC1= {Erlotinib,Lapatinib} hasEGFRas a common target [30,31], the second setSC2= {PF-02341066(Crizotinib), PHA-665752} hasMETas a common target [36,37], the third setSC3= {ZD6474(Vandetanib), AZD0530(Saracatinib)} hasEGFRas a common target [38,39] and the fourth setS4= {17-AAG, Erlotinib} has no common target. We also include results on 4 drug set and 24 drug sets for CCLE as Tables A, F and G inS1 File.

Initially, each cell line had 18,988 features (probeset) as gene expressions. We reduced it to 500 for each drug response using RELIEFF [28] feature selection and considered a union of the 500 features in each of the four sets of drugs. We have used first 300 cell lines that have gene expression and drug responses for specific pairs of drugs. To report our results, we compared 5 fold cross-validated Pearson correlation coefficients, MAE and NRMSE between predicted and experimental responses for RF, VMRF, CMRF and KBMTL. For both VMRF and CMRF, we set the minimum size of samples in each leaf tonsize= 5, the number of trees in the forest to T= 150 and the splitting in each node considersm= 10 random features.

The correlation coefficients using 5 fold cross validation error estimation are illustrated for each drug set inTable 5. The corresponding MAE and NRMSE behaviors are illustrated in

Fig 7. Scatter plots of predicted response vs original response for Erlotinib and Lapatinib (GDSC).Here corr-coef stands for correlation coefficient between predicted response and output response.

(18)

Table 6. For CMRF, results with scaling factorαselected usingMethod-1discussed earlier has been used. Tables5and6shows that CMRF performed better than VMRF, KBMTL and RF in terms of correlation coefficients and NRMSE for the related drug pairsSC1,SC2, andSC3. When there is no relationship in the drug pair as inSU, univariate RF performs better than the

multi-variate approaches on an average. The scatter plots of predicted response vs original response for drug-setSC2using RF, VMRF, CMRF are shown inFig 8.

Results of Variable Importance Analysis

We have examined the variable importance measure for GDSC data using VMRF and CMRF in terms of protein interaction network enrichment analysis. In this section, we will primarily provide the detailed results forSC1in GDSC. To avoid any bias due to feature selection in vari-able importance, we consider the full set of probe set ids without application of RELIEFF for this analysis.

In both VMRF and CMRF, the 50 top ranked probesets were generated separately. It should be noted that multiple probeset IDs can map to a single Gene Symbol of a protein. This map-ping was done in Genome Medicine Database of Japan (GeMDBJ) ID conversion tool (https:// gemdbj.nibio.go.jp/dgdb/ConvertOperation.do). Based on this mapping, we arrived at 58 top ranked proteins for VMRF and 70 top ranked proteins for CMRF. These proteins were

Table 5. 5 fold CV results for CCLE Dataset drug sensitivity prediction for four drug sets in the form of correlation coefficients forRF,VMRF,CMRF

andKBMTL.

Correlation Co-ef_ﬁcients

Drug Set Common Target Drug Name RF VMRF CMRF KBMTL

SC1 EGFR Erlotinib 0.3916 0.3980 0.3927 0.3457 Lapatinib 0.4460 0.4468 0.4673 0.2609 SC2 MET Crizotinib 0.4813 0.4719 0.4882 0.4519 PHA-665752 0.3547 0.3587 0.3746 0.2250 SC3 EGFR ZD-6474 0.2355 0.2535 0.2627 0.1304 AZD-0530 0.1990 0.1844 0.1957 0.1973 SU None 17-AAG 0.3620 0.3337 0.3255 0.4100 Erlotinib 0.3818 0.3852 0.3718 0.2828 doi:10.1371/journal.pone.0144490.t005

Table 6. 5 fold CV results for CCLE Dataset drug sensitivity prediction for four drug sets in the form of MAE and NRMSE for_RF,_VMRF,_CMRFand

KBMTL.

MAE NRMSE

Drug Set Common Target Drug Name RF VMRF CMRF KBMTL RF VMRF CMRF KBMTL

SC1 EGFR Erlotinib 0.0522 0.0520 0.0515 0.0612 0.9223 0.9210 0.9218 1.0593 Lapatinib 0.0513 0.0520 0.0509 0.0654 0.8976 0.8977 0.8895 1.1398 SC2 MET Crizotinib 0.0484 0.0483 0.0477 0.0546 0.8836 0.8921 0.8828 0.9674 PHA-665752 0.0492 0.0496 0.0489 0.0614 0.9367 0.9367 0.9307 1.1573 SC3 EGFR ZD-6474 0.0660 0.0659 0.0656 0.0876 0.9721 0.9674 0.9650 1.3037 AZD-0530 0.0728 0.0728 0.0727 0.0866 0.9801 0.9834 0.9810 1.2188 SU None 17-AAG 0.1003 0.1005 0.1008 0.0997 0.9553 0.9614 0.9644 0.9740 Erlotinib 0.0517 0.0519 0.0520 0.0612 0.9258 0.9260 0.9311 1.0957 doi:10.1371/journal.pone.0144490.t006

(19)

provided as inputs to the string-db database (http://string-db.org/) for known protein-protein interactions. The protein-protein interaction (PPI) networks for top proteins using CMRF and VMRF are shown in Figs9and10respectively. The enrichment analysis for both the networks are shown alongside each network. We observe that the network generated using CMRF is more enriched in connectivity than the network generated using VMRF. 18 interactions with a p-value of 0.132 were observed for the VMRF PPI network whereas a total of 35 interactions with a p-value of 0.00775 were observed for the CMRF network. Moreover, the common target EGFR is picked in the top 50 targets and is well connected to other targets of CMRF whereas EGFR is not selected even in top 150 targets of VMRF.

Similarly, in drugsetSC2of GDSC (network not shown), there are 42 interactions with 51 proteins in CMRF and 25 interactions with 54 proteins in VMRF.

Conclusions

In this article, we presented an approach to extend ensemble learning using regression trees to multivariate ensemble learning. We utilized the concept of copulas to represent the relationship between different drug sensitivities and incorporated them in the design of multivariate regres-sion tree cost function. We designed the node cost function as a combination of (a) the sum of square of the differences from the mean and (b) a measure of the difference in the multivariate structure at the node compared to the original training data. The difference in the multivariate structure was captured as the integral of the absolute difference in the copulas observed at the node and the original training data. Two approaches were presented based on enumeration

Fig 8. Scatter plots of predicted response vs original response for Crizotinib and PHA-665752 (CCLE).Here corr-coef stands for correlation coefficient between predicted response and output response.

(20)

and Pareto frontier to design the weights of the two parts of the cost function. Utilizing syn-thetic and biological data, we showed that the proposed copula based approach could increase the prediction accuracy as compared to univariate random forests or multivariate random for-ests based on covariance based node cost. As compared to RF, the gain in the correlation coeffi-cient between predicted and experimental values was observed in scenarios where there exists a relationship between the drug pair sensitivities. The examples were also able to illustrate that CMRF is better suited for selecting the relevant features as compared to VMRF. The proposed methodology provides a novel technique to design multivariate regression trees for scenarios where there are nonlinear relationships between output responses. The presented research can be extended in multiple directions. One such direction will involve extending the concept of maintaining the multivariate structure in the design of weights of individual trees. Another direction consists of analyzing the detailed bias and variance relationship of the proposed tech-nique and designing confidence intervals for the predictions.

Fig 9. Protein-protein interaction network observed between top regulators found from CMRF in GDSC datasetSC1.Disconnected nodes are hidden.

(21)

Supporting Information

S1 File. Supporting Information for Article:A copula based approach for design of multi-variate random forests for drug sensitivity prediction.RF, VMRF, CMRF results (5 fold cross validation) with and without prior feature selection (Table A). Results for CCLE Dataset drug sensitivity prediction for a drugset with 4 drugs in the form of correlation coefficients forRF,

VMRF,CMRFandKBMTLapproaches (Table B). Results for GDSC Dataset drug sensitivity pre-diction for a drugset with 3 drugs in the form of correlation coefficients forRF,VMRF,CMRF

andKBMTLapproaches (Table C). Results for GDSC Dataset drug sensitivity prediction for a drugset with 140 drugs in the form of correlation coefficients is shown (only 15 drugs that are common with CCLE are shown in detail while the average represents the average of all 140 drugs) (Table D). Results for GDSC Dataset drug sensitivity prediction for a drugset with 140 drugs in the form of NRMSE is shown (only 15 drugs that are common with CCLE are shown in detail while the average represents the average of all 140 drugs) (Table E). Results for CCLE Dataset

Fig 10. Protein-protein interaction network observed between top regulators found from VMRF in GDSC datasetSC1.Disconnected nodes are hidden.

(22)

drug sensitivity prediction for the combined set of 24 drugs in the form of correlation coefficients (Table F). Results for CCLE Dataset drug sensitivity prediction for the combined set of 24 drugs in the form of Normalized Root Mean Square Error (Table G). Comparison ofαfor different sets of synthetic data with and without noise added to the drug response (Table H). Comparison ofα for different amount of random subset of the original samples in a specific synthetic data. Original number of samples were 350 in this specific example (Table I). Simulation time for different drug-sets in GDSC data. The reported simulation times are the time needed to generate complete result for all drugs in a drug set for 5 fold cross validation (Table J). Simulation time for different drug-sets in GDSC data. The reported simulation times are the time needed to generate complete result for all drugs in a drug set for 30–70 case (Table K). Simulation time for different methods for all drugs of GDSC dataset (140) and CCLE dataset (24). The reported simulation times are the time (in seconds) needed to generate complete result for all drugs for 30–70 case (Table L).

(PDF)

Author Contributions

Conceived and designed the experiments: SH RP. Performed the experiments: SH RR. Ana-lyzed the data: SH RR SG RP. Contributed reagents/materials/analysis tools: SH RR SG RP. Wrote the paper: SH RR RP.

References

1. Sos ML, et al. Predicting drug susceptibility of non-small cell lung cancers based on genetic lesions. The Journal of clinical investigation. 2009; 119(6):1727–1740. doi:10.1172/JCI37127PMID:19451690

2. Staunton JE, et al. Chemosensitivity prediction by transcriptional profiling. Proceedings of The National Academy of Sciences. 2001; 98:10787–10792. doi:10.1073/pnas.191368598

3. Lee JK, et al. A strategy for predicting the chemosensitivity of human cancers and its application to drug discovery. Proceedings of the National Academy of Sciences. 2007 Aug; 104(32):13086_–13091. doi: 10.1073/pnas.0610292104

4. Mitsos A, et al. Identifying Drug Effects via Pathway Alterations using an Integer Linear Programming Optimization Formulation on Phosphoproteomic Data. PLoS Comput Biol. 2009; 5(12). doi:10.1371/ journal.pcbi.1000591PMID:19997482

5. Walther Z, Sklar J. Molecular tumor profiling for prediction of response to anticancer therapies. Cancer J. 2011; 17(2):71_–9. PMID:21427550

6. Zou H, Hastie T. Regularization and variable selection via the Elastic Net. Journal of the Royal Statisti-cal Society, Series B. 2005; 67:301–320. doi:10.1111/j.1467-9868.2005.00527.x

7. Barretina J, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012 Mar; 483(7391):603–607. Available from:http://dx.doi.org/10.1038/ nature11003. doi:10.1038/nature11003PMID:22460905

8. Riddick G, et al. Predicting in vitro drug sensitivity using Random Forests. Bioinformatics. 2011 Jan; 27 (2):220–224. doi:10.1093/bioinformatics/btq628PMID:21134890

9. Costello JC, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology. 2014;p. doi:10.1038/nbt.2877

10. Wan Q, Pal R. An ensemble based top performing approach for NCI-DREAM drug sensitivity prediction challenge. PLOS One. 2014; 9(6):e101183. doi:10.1371/journal.pone.0101183PMID:24978814

11. Gonen M, Margolin AA. Drug susceptibility prediction against a panel of drugs using kernelized Bayes-ian multitask learning. Bioinformatics. 2014; 30(17):i556–i563. doi:10.1093/bioinformatics/btu464 PMID:25161247

12. Menden MP, Iorio F, Garnett M, McDermott U, Benes CH, Ballester PJ, et al. Machine Learning Predic-tion of Cancer Cell Sensitivity to Drugs Based on Genomic and Chemical Properties. PLoS One. 2013; 8(4). doi:10.1371/journal.pone.0061318PMID:23646105

13. Segal M XY. Multivariate Random Forests. Wiley Interdisciplinary Reviews: Data Mining and Knowl-edge Discovery. 2011; 1:80–87.

14. Wan Q, Pal R. A multivariate random forest based framework for drug sensitivity prediction. In: Interna-tional Workshop on Genomic Signal Processing and Statistics (GENSIPS); 2013. p. 53.

(23)

15. Breiman L. Random Forests. Machine learning. 2001; 45:5_–32. doi:10.1023/A:1010933404324

16. Meinshausen N. Quantile Regression Forests. Journal of Machine Learning Research. 2006; 7:983–999.

17. Biau G. Analysis of a random forests model. The Journal of Machine Learning Research. 2012; 13:1063–1095.

18. Mahalanobis PC. On the generalised distance in statistics. Proceedings of the National Institute of Sci-ences of India. 1936; 2(1):49_–55.

19. Sim KC, Gales M. Precision matrix modelling for large vocabulary continuous speech recognition. Cambridge University Engineering Department. June 2004;p. Appendix B1.

20. Sklar A. Fonctions de répartition á n dimensions et leurs marges. Publ Inst Statist Univ Paris. 1959; 8:229–231.

21. Clayton DG. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. International Statistics Reviews. 1978; 65:141–151.

22. Lee L. Generalized Econometric Models with Selectivity. Econometrica. 1983; 51:507–512. doi:10. 2307/1912003

23. Frank MJ. On the simultaneous associativity of F(x,y) and x+y−F(x,y). Aequationes Math. 1979; 19:194–226. doi:10.1007/BF02189866

24. Demarta S, McNeil AJ. The t copula and related copulas. International Statistical Review. 2005; 73:111–129. doi:10.1111/j.1751-5823.2005.tb00254.x

25. Gumbel EJ. Distributions des Valeurs Extremes en Plusieurs Dimensions. Publications de l_’Institute de Statistique de l’Université de Paris. 1960; 9:171–173.

26. Kouros Owzar PKS. Copulas: concepts and novel applications. International Journal of Statistics. 2003; LXI (3):323_–353.

27. Archer KJ, Kimes RV. Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis. 2008; 52 (4):2249–2260. doi:10.1016/j.csda.2007.08.015

28. Robnik-Sikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning. 2003; 53:23–69. doi:10.1023/A:1025667309714

29. Yang W, et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic acids research. 2013; 41(D1):D955–D961. doi:10.1093/nar/gks1111 PMID:23180760

30. Ling YH, Li T, Yuan Z, Haigentz M, Weber TK, Perez-Soler R. Erlotinib, an effective epidermal growth factor receptor tyrosine kinase inhibitor, induces p27KIP1 up-regulation and nuclear translocation in association with cell growth inhibition and G1/S phase arrest in human non-small-cell lung cancer cell lines. Molecular pharmacology. 2007; 72(2):248–258. doi:10.1124/mol.107.034827PMID:17456787

31. Johnston SR, Leary A. Lapatinib: a novel EGFR/HER2 tyrosine kinase inhibitor for cancer. Drugs Today (Barc). 2006; 42(7):441–453. doi:10.1358/dot.2006.42.7.985637

32. Super Target;.Http://bioinf-apache.charite.de/supertarget_v2/index.php?site = home.

33. Falchook GS, Lewis KD, Infante JR, Gordon MS, Vogelzang NJ, DeMarini DJ, et al. Activity of the oral MEK inhibitor trametinib in patients with advanced melanoma: a phase 1 dose-escalation trial. The lan-cet oncology; 13(8):782_–789. doi:10.1016/S1470-2045(12)70269-3PMID:22805292

34. Ciuffreda L, Del Bufalo D, Desideri M, Di Sanza C, Stoppacciaro A, Ricciardi MR, et al. Growth-inhibi-tory and antiangiogenic activity of the MEK inhibitor PD0325901 in malignant melanoma with or without BRAF mutations. Neoplasia (New York, NY). 2009; 11(8):720. doi:10.1593/neo.09398

35. Broad-Novartis Cancer Cell Line Encyclopediahttp://www.broadinstitute.org/ccle/home;. Genetic and pharmacologic characterization of a large panel of human cancer cell lines.

36. Tanizaki J, Okamoto I, Okamoto K, Takezawa K, Kuwata K, Yamaguchi H, et al. MET tyrosine kinase inhibitor crizotinib (PF-02341066) shows differential antitumor effects in non-small cell lung cancer according to MET alterations. Journal of Thoracic Oncology. 2011; 6(10):1624–1631. PMID:21716144

37. Ma PC, Schaefer E, Christensen JG, Salgia R. A selective small molecule c-MET Inhibitor, PHA665752, cooperates with rapamycin. Clinical cancer research. 2005; 11(6):2312–2319. doi:10. 1158/1078-0432.CCR-04-1708PMID:15788682

38. Morabito A, Piccirillo MC, Falasconi F, De Feo G, Del Giudice A, Bryce J, et al. Vandetanib (ZD6474), a dual inhibitor of vascular endothelial growth factor receptor (VEGFR) and epidermal growth factor receptor (EGFR) tyrosine kinases: current status and future directions. The oncologist. 2009; 14 (4):378_–390. doi:10.1634/theoncologist.2008-0261PMID:19349511

39. Larsen AB, Stockhausen MT, Poulsen HS. Cell adhesion and EGFR activation regulate EphA2 expres-sion in cancer. Cellular signalling. 2010; 22(4):636–644. doi:10.1016/j.cellsig.2009.11.018PMID: 19948216