Datasets - Methodology and Workflow - Machine Learning for Modelling Tissue Distribution of Dru

3. Methodology and Workflow

3.1. Datasets

3.1.1. ABC Efflux Dataset

A dataset of 1493 compounds was compiled from the substrate data available in the Metrabase database (Mak et al., 2015) (accessed in October 2014) for six ABC transporters: BCRP1, MDR1, MRP1, MRP2, MRP3 and MRP4. All instances were divided into two classes: substrates and non-substrates. The collection of SMILES provided was checked for repetitions and isomers using ACD Labs, and mixtures were removed. Repetitions were merged and, for cases of conflicting information, the principle of minimum evidence was applied, by which all compounds with at least one case of reported substrate property were regarded as potential substrates and so, they were classified as substrates. This is a valid approach considering that all the initial data collected from Metrabase was selected based on quality standards (Mak et al., 2015).

The resulting dataset contained 1493 compounds which showed a negligible imbalance in class label distribution for larger transporter classes, i.e. BCRP1, MDR1, MRP1 and MRP2 compounds, with the substrate to non-substrate ratio of 1.7, 1.3, 1.0 and 1.2, respectively. However, for the smaller transporter classes, namely MRP3 and MRP4, the ratio was around 2.5, which led to insufficient number of non-substrates for modelling and validation. Therefore, these two transporters were eliminated and the remaining four transporters were investigated, using a final dataset of 1462 compounds spread across transporters as shown in Figure 3.1.

Figure 3.1. Schematic summary of transporter overlap represented in the Venn diagram. Below each

transporter label are the total number of instances (in a square) in the full dataset, and the corresponding number of substrates and non-substrates. S: substrates, NS: non-substrates.

3.1.2. SLC Uptake Dataset

Substrate and non-substrate data was retrieved from Metrabase (Mak et al., 2015) (accessed in November 2015), and the binding profiles of all available SLC transporters were collated. This corresponded to OATP1A2, OATP1B1, OATP2B1, OATP1B3, OCT1 and PEPT1. Even though data on the Apical Sodium Dependent Bile Acid Transporter (ASBT) was also available for analysis, this was removed later due to lack of sufficient amount of data to allow acceptable external model testing.

Regarding the annotation of molecular descriptors in this dataset, this was done using the structures provided as SMILES codes were analysed and validated using ACD labs. All duplicates and pairs of isomers were checked using ChemSketch. Duplicated entries were merged when exhibiting agreeing responses, and when contradicting responses were found (4.3% of the observations) the respective entry was annotated as substrate, following the principle that, if a compound has at least one substrate report, it is likely to be a substrate, following the same reasoning applied to the ABC Efflux Dataset (Section 3.1.1). As a result, substrate/non-substrate data for a total of six transporters were used for QSAR modelling, where each transporter corresponds to a class label in the multi-label classification task carried out in this work.

The dataset had a total of 760 unique compounds spread across 980 instance-label cases distributed as shown in Table 3.1. The full compound vs labels matrix was 21.5% filled in and had a label cardinality of 1.3 (i.e., on average, there are 1.3 labels per instance).

Table 3.1. Distribution of substrates (S) and non-substrates (NS) across the different transporters in

the SLC dataset. S NS OATP1A2 55 24 OATP1B1 95 37 OATP1B3 58 26 OATP2B1 47 64 OCT1 159 88 PEPT1 246 81

3.1.3. Volume of Distribution Dataset

The Vd dataset compiled by Obach et al (Obach et al., 2008) was used in the QSAR modelling. This dataset is composed by Vd measurements obtained exclusively from human intravenous administration, in steady state. The SMILES codes for the compounds in the dataset were retrieved from the provided names using the pubchempy module in

python. Retrieved SMILES were checked against a separate retrieval operation of CAS-to- CID search on DrugBank, followed by CID-to-SMILES conversion using the online tool available in PubChem. The few mismatching cases (N=9) found when comparing the two sources of SMILES were clarified through manual retrieval.

The final list of SMILES codes was then standardised using the MolVS python package, which entailed standardisation of chemotypes and tautomers. A comparison of canonical SMILES allowed identifying pairs of repeated 2D structures where pairs of isomers were kept (given that 3D descriptors will be used); otherwise, for pairs of canonical + isomeric structures, only one instance was kept (based on the availability of observations from the included physiological variables in the dataset – see below). The final dataset was composed of 665 compounds.

Regarding the annotation of the dataset with physiological descriptors, given the physiological implication of protein-mediated transport, phospholipidosis (PL), and plasma protein binding (PPB) in the distribution of numerous compounds, these were added as physiological descriptors (PDs), used alongside a set of molecular descriptors (MDs). Specifically, twelve PD variables were added: drug-induced PL, ABC transport (mediated by P-gp, BCRP, MRP1 and MRP2), SLC transport (mediated by PEPT1, OCT1, OATP1B1, OATP1B3, OATP1A2 and OATP2B1) and PPB.

Transporter Binding Data was retrieved from the ABC and SLC datasets described in 3.1.1 and 3.1.2, and compounds in the Vd dataset were annotated with a binary response. The same was done for Drug-Induced PL Data which was retrieved from different sources available in the literature(Lowe et al., 2012, Goracci et al., 2013, Orogo et al., 2012, Bauch et al., 2015, Muehlbacher et al., 2012) as well as from ChEMBL (removing repeated observations from the previous references). From these, Goracci et al(Goracci et al., 2013) and Lowe et al (Lowe et al., 2012) were regarded as the gold standards, as their data is obtained from electron microscopy measurements (the highest quality source for phospholipidosis data). The remaining sources are herein termed as “secondary”. As a result of this criterion, whenever the full set of measurements for a given compound shows conflicting responses, the responses from those two sources are kept and the remaining conflicting data is ignored. When no information is provided from any of these two sources, information from any of the other sources is accepted, given that two or more competing observations for the same compound have to agree in order to be accepted. For 5 instances there were no agreement between multiple secondary sources, so their PL observations were discarded.

Whenever applicable, phospholipidosis or transporter information associated to a given isomer is assigned exclusively to the corresponding isomer entry in the Vd dataset. Otherwise, it is assigned to the corresponding non-isomeric equivalent entry.

As the number of entries in the Vss dataset where an experimental response for transport or PL could be retrieved were limited, the non-existent responses were completed with predictions. Each of the 10 transporter variables were filled in with the output produced by two multi-label models (chapter 4 and 5) applied in a preprocessing step. This was also applied to PL, where predictions were obtained from a previously trained model on the benchmark PL dataset curated by Goracci. This model was trained prior to the initiating the work in this chapter, by using physicochemical descriptors (obtained as explained in section 3.2) and random forest classification paired with prior greedy search feature selection, implemented in WEKA (Hall et al., 2009) under default conditions. These conditions were selected after preliminary optimization, based on the highest performance in an internal 10- fold cross validation on the training set (not using the testing set). Given that the dataset used is unbalanced, a cost of 2 was assigned to the false negatives (while false positives remained with the default cost of 1) during training. The model was built on 80% of the dataset and tested on the remaining 20%, showing high sensitivity (0.857) and specificity (0.711) on the test set.

To allow differentiation of the transporter predictions and the PL predictions according to their quality, they were used in the form of class probabilities (rather than categorical class predictions). A schematic representation of this process can be found in Figure 3.2. Here, for each transporter, experimental observation is annotated as substrate (1) or non- substrate (2) – shown in black. All missing experimental observations (e.g. compound #1 for BCRP1) are completed with the predictions probabilities drawn from the substrate class

of each transporter’s classification model (i.e. probabilities up to 0.5 represent a predicted non-substrate, probabilities above 0.5 represent a predicted substrate) – shown in blue.

Figure 3.2. Completion of missing transporter binding data with the prediction probabilities obtained

Lastly, regarding the annotation with Plasma Protein Binding (PPB) Data, this predictor was submitted to the same procedure of completing missing observations with predicted data. For this reason, a prior step of PPB modelling had to be carried out. In order to maximize the achieved predicted power, the PPB data provided by Obach et al was not used and, instead, a larger dataset was used. The PPB dataset deposited in ChEMBL by AstraZeneca was used the single source of data (assay reference CHEMBL3301361) being composed of 1614 compounds. No other data source was added onto this dataset as this was deemed sufficiently large (relative to the scale of the Vss dataset) and doing so reduces the chance of noise that results from inter-laboratory experimentation. The data was modelled using physicochemical descriptors (obtained as explained in section 3.2) and a random forest model (of 200 trees, optimized by 10-fold CV) paired with greedy search pre-processing feature selection. Eighty percent of the data was used for training, and the resulting model yielded a 7.9% mean absolute error in the test set; PPB predictions obtained were used to fill in missing data.

3.2. Molecular Descriptors

All molecular descriptors used as input variables throughout this work were calculated using ACD/labs logD and Molecular Operating Environment (MOE 2013 in chapters 4, 5 and 8; MOE 2015 in chapters 6 and 7). Prior to any calculation, input structures obtained in form of SMILES codes were washed and standardized. As a portion of descriptors calculated in MOE are dependent on the 3D conformation of a compound, all structures were submitted to a minimization protocol beforehand. An initial molecular mechanics minimization was performed (further information on the used method is provided in the Methods section of each experimental chapter), followed by a subsequent refinement with quantum mechanics minimization (using the PM6 method).

No single charge-assignment method was selected over any other, across homologous descriptors, as it has been shown that different charge assignment methods have led to variable success in modelling different datasets in the past (Mittal et al., 2009). This allows a data-driven selection of charge-related molecular descriptors using PEOE vs PM6 methods, as well as various descriptors derived from semi-empirical methods, AM1, PM3 and MNDO.

All invariant or mainly empty descriptors were excluded, as well as repeated and spatial coordinates-dependent descriptors. Descriptors with predictions of activity/response endpoints (such as mutagenicity) were also excluded. pKa and pKb values, calculated for

the most acidic and basic species, were used to calculate the ionized fraction of acid, base and zwitterion at 7.4, as well as the unionized fraction. After this, pKa and pKb were excluded.

In document Machine Learning for Modelling Tissue Distribution of Drugs and the Impact of Transporters (Page 75-80)