• No results found

Integration of Chemogenomics Data for Knowledgebase Development and Modeling

Chapter 2 Methodology

2.2 Integration of Chemogenomics Data for Knowledgebase Development and Modeling

Modeling

The number of compounds and activity records available at major publicly-accessible portals such as PubChem [172] and ChEMBL [41, 42] increased dramatically in the order of millions. With limited or no access to commercial resources, much of the academic research relies on the data from these resources. However, many studies reported inconsistencies and uncertainties with compound structure representation and the heterogeneous compound activity data [136, 180, 189, 190]. The choice of descriptors has a strong influence on the resulting QSAR models. Therefore, the erroneous representation of chemical structures could hamper the performance of the models [136]. It was also reported that the activity values of chemical compounds obtained from different laboratories frequently disagree [138]. Thus, the establishment of appropriate search criteria to mine the wealth of data and careful integration of data extracted from different resources are highly essential. To this end, the recommendations [138, 191, 192] proposed in the literature were essentially practiced. 2.2.1 Integration of Compound Data

While many major resources provide compounds in standard file formats such as SDF (with 2D or 3D coordinates), some provide simpler representations such as SMILES. Depending on the software and sometimes the version of the software used to generate them, minor discrepancies can be expected. Therefore, the data obtained from different sources may contain duplicate entries. The following steps were adapted consistently throughout this thesis to integrate compound data obtained from different resources.

(a) Curation of Chemical Structures

This step involves standardization of the chemical structures in the data set. The curation protocol typically starts with the removal of inconsistent chemical records such as inorganic compounds, mixtures of compounds, counterions, and biologics. Next, the structures are validated by correcting violations in valency of atoms, aromaticity, tautomers, and charges. Finally, the structure of the compound is cleaned and a 3D representation is generated. Dealing with tautomers is challenging since the ratio of different tautomers is subjective [193]. A number of software tools facilitate performing these tasks. For instance, the JChem suite from ChemAxon [194] provides a standardizer

and the open-source cheminformatics toolkit RDKit [195] provides Structure Normalizer nodes in KNIME [196]. More detailed guidelines are provided by Tropsha et al [138, 192]. In this thesis, the curation of chemical structures was performed using the InstantJChem software from ChemAxon, accessed with an academic license.

(b) Identification of Duplicates

It is often the case that the same compound is tested in different experiments and recorded multiple times in a bioactivity database [192]. For instance, the same compound available from different vendors might be tested in the same assay across multiple laboratories resulting in multiple activity records identified by different internal identifiers [197]. This is also the case when collating drugs from different resources that use different internal identifiers. Detecting the structurally identical compounds is the first step in dealing with such data. Many methods and freely accessible tools are available that identify duplicates based on different structural representations such as molecular descriptors, chemical names, SMILES, database identifiers etc. [192, 198, 199]. In this thesis, hashed InChI notation (commonly referred to as standard InChIKey) was employed owing to its wide acceptance as a standard chemical structure identifier [66]. Compounds standardized in the previous step are processed to generate InChIKey notations that are checked for duplicates via string matching. 2.2.2 Integration of Compound Bioactivity Data

Large-scale treatment of bioactivity data is a much difficult endeavor compared to the previous steps. Pharmaceutical companies often the measure activity of a compound in duplicates or triplicates in the same assay in order to assess the experimental variability of the assays using different statistical metrics [178, 192]. Since such data is often not available for academic research, alternative recommendations are needed that facilitate efficient mining of the bioactivity data in the public domain. Identification of duplicate compound entries is the starting point when treating compound bioactivity data. Therefore, the protocol starts with the two steps described before. The following steps are followed in order to arrive at curated sets of bioactivity data.

(a) Search Criteria for High-confidence Bioactivity Data

It is acknowledged that large amounts of compound bioactivity data are heterogeneous and are therefore associated with different experimental uncertainties and hence different levels of

confidence [191]. For instance, the target annotations of drugs from ChEMBL databases highly varied with different data selection criteria [200]. Therefore, the data selection criteria influence the conclusions drawn from such data. Practical recommendations were proposed by Bajorath et al to select compound data sets with high confidence [191]. These criteria, outlined in Figure 2.1, have been adapted for this thesis.

Figure 2.1: Compound selection criteria for generation of high-confidence bioactivity data. The criteria and figure are adapted from [191].

(b) Detection of activity cliffs

The presence of pairs of compounds that share a high structural similarity and possess highly different bioactivity values is considered as one of the challenges in the development of robust QSAR models [132]. Such pairs of compounds are referred to as activity cliffs (Figure 2.2) [135, 201]. Different similarity assessment strategies could be employed to identify activity cliffs. Matched molecular pair (MMP) and fingerprint similarity-based approaches are commonly employed to detect activity cliffs [201]. Identification and treatment of activity cliffs are recommended as one of the criteria before initiation of a computational study [192]. Consideration of 3D structural differences might be subject to the availability of the 3D structure of the target and the binding modes of at least one compound

High-confidence bioactivity

data

Treat compounds with multiple activity values

Only records with relation type: “=”

Only records with standard unit: nanoMolar (nM) Only records with activity type:

IC50or Ki All records available in the

bioactivity database

Only records with interactions against human targets

Only records with assay confidence score = 9 Only records with assay type:

binding/functional assay

Data Confidence

Data Set Size

from the pair forming an activity cliff. Therefore, in this thesis, detection of the 2D activity cliffs alone was considered.

Figure 2.2: Exemplary activity cliffs found within the hERG bioactivity data set from the ChEMBL database. The activity cliffs are based on (a) matched molecular pair; (b) fingerprint similarity (ECFP4).

(c) Estimation of data set modelability

Having analyzed the impact of activity cliffs on the performance of QSAR models, Tropsha et al introduced the concept “data set modelability” which provides a prior estimate of the feasibility of obtaining a predictive QSAR model using a given data set [132]. Estimation of MODelability Index (MODI) not only facilitates identification of a subset of the data set with high modelability but also the best set of descriptors that may result in highly predictive models [192]. MODI was originally defined as “an activity class-weighted ratio of the number of nearest-neighbor pairs of compounds with the same activity class versus the total number of pairs” [132]. The higher the MODI value, the higher the likelihood to obtain a highly predictive QSAR model. In general, a MODI value of 0.6 was proposed as the threshold for a data set to qualify for a computational study [192]. However, different sets of descriptors might provide different MODI values for the same data set, as also observed in this thesis.

(a) MMP-cliff (b) ECFP4-cliff CHEMBL1093044 (pKi= 5.2) CHEMBL1091775 (pKi= 9.1) CHEMBL1822857 (pKi= 5.65) CHEMBL1822854 (pKi= 7.8) Tc= 0.82