5. Using Multi-label Classification to Explore the Link among the Solute Carriers (SLCs)
5.2.2. Multi-label QSAR modelling
As the purpose of this effort was to investigate the relationships between transporters, a CC technique was selected as the learning scheme due to its ability to account for potential transporter correlation (called label interaction in the multi-label machine learning context). If a given set of compounds has several measured response variables, e.g. interaction (or lack thereof) with several transporters (each called a label within the data mining context), there is a possibility of correlation between labels, and if correlations are indeed present, exploring them reduces the complexity of the learning task (Gibaja and Ventura, 2014).
In this work, compounds are classed as substrate or non-substrate of each of the six transporters (labels) and the SLC dataset was used to train models in a feedforward chain sequence, implemented as follows. A classifier for label #1 (the first transporter) is trained (using molecular descriptors as the model features) and feeds its prediction set (predicted class for the compounds) to the classifier for label #2 (the second transporter in the chain), which is, in turn, trained using label #1 predictions alongside molecular descriptors. The classifiers for label #2 and label #1 then feed their class predictions forward to the classifier
for label #3, and so on. Note that each label model, i.e. a model for a single transporter, within a multi-label scheme is called a single-label model. In summary, any available prior predictions will be used as a feature at any point in the chain alongside the molecular descriptors, so the classifier for label #6 will use predictions from labels #1 through #5 as additional features to complement the molecular descriptors. Predicted labels used as
descriptors are generically termed “pLabel(s)”, where specific pLabels are named by
prefixing the label in question with a “p” (e.g pMDR1).
To allow full exploration of all types of interactions between labels, an exhaustive search of all possible combinations of chain sequences was carried out, as shown in Scheme 5.1. This entails that all label permutations (orderings) in a 6-label chain are tested; and, for each of the possible combinations of shorter chain sizes, all possible permutations for that combination are also tested. As this problem is focused on addressing all 6 transporters, shorter chain sizes will be completed, by default, with alternative standalone single-label models, as demonstrated in Scheme 5.1. Note that standalone single-label models are originated from an alternative BR model that is built as the non-label interaction baseline comparator to the CC model. In summary, the hypotheses being tested in this study are two-fold: Is there any benefit from accounting for transporter overlap? If so, which transporters’ substrate prediction benefits from information from other transporters?
Scheme 5.1. Schematic representation of the exploration space of possible label (transporter)
combinations. Note that all but the last line in the scheme represent different formats of the CC model of varying lengths, and the last line represents the BR alternative model.
In order to maximize predictive accuracy, prior to building the label combinations represented in Scheme 5.1, each label’s modelling conditions (classifier algorithm and feature set) were optimized and a selected single-label model was established for each
transporter. To this end, each label was modelled with each of the five available feature sets generated by the five feature selection methods, and the best classifier-feature set pairs were selected based on the performance on the validation set. This was done for three different classifiers available in WEKA: C4.5 (J48), RF (RandomForests) and boosted C4.5 trees (multiBoostAB + J48), which were tuned using 10-fold cross validation. For the C4.5 models, the pruning was tuned as per section 3.4. For the RF the number of trees was optimized (ranging between 2 and 1000 trees, with a step of 50 trees). For the boosted trees (BT), the conditions for the embedded C4.5 trees were inherited from the previously optimized C4.5 models, the number of committees (or iterations) was optimized ranging from 10 to 100, and the number of subcommittes was set to the squared root of the committee size as recommended by the author (Webb, 2000). Additionally, whenever a classifier failed to generate a good model for a certain label (G-mean < 0.7, where G-mean is defined in section 3.5), the algorithm was re-run using the feature set that previously generated the model with the highest G-mean, and a misclassification cost (optimized between 2 and 6) was applied to penalize mispredictions of the minority class – that is, the cost of misclassifying an instance of the minority class is multiplied by a number between 2 and 6, whilst the cost of misclassifying an instance of the majority class remains 1.
Two labels (PEPT1 and OATP1B3) produced models with non-acceptable performance (i.e., they had either sensitivity or specificity below 0.5, in the validation set) with any of the above classifiers. As a result, special methods were applied to them. Initially, the synthetic minority over-sampling technique (SMOTE) was applied following the re-running of the best modelling conditions up to this point, for each transporter respectively. This showed acceptable performance for OATP1B3, but not for PEPT1. To overcome this, PEPT1 was submitted to an under-over bagging (UOBag) procedure similar to a procedure in the literature (Galar et al., 2012), which led to acceptable performance. This consisted of a series of 10 runs where, in each run, the dataset was submitted to SMOTE, which added 50 (100%) instances to the minority class, followed by undersampling of 47 (32%) instances in the majority class (to reach two balanced classes), and 80% random sampling from the total resulting data. The sampled subset was then used to build a C4.5 model (using parameters inherited from prior C4.5 optimization) with an applied misclassification cost of 2 to each false positive prediction achieved during training. This generated an ensemble of ten C4.5 models that form the final UOBag model.