Data Handling KNIME - Materials and Methods

2 Materials and Methods

2.5 Data Handling KNIME

KNIME (Konstanz Information Miner)242_{is a modular environment that enables visual assembly}

and execution of data pipelines. Nodes are assembled into workflows for data processing.

2.5.1 Filter Results of HTS and Hit Confirmation Assay

A KNIME242_{workflow was created to filter the results of hit confirmation assay, PubChem}

BioAssay AID: 5883475_{, completed by the NIH Chemical Genomics Center (NCGC), to identify}

compounds from the set that were defined as ‘active’, had acceptable curve descriptions, discarding any with efficacy >150 calculated from a partial curve, and were available from preferred suppliers.

Of the 145 compounds screened, 82 were available from preferred suppliers, identified using workflow prepared by Dr Ben Wahab (Sussex Drug Discovery Centre, University of Sussex). This

filter output an SDF file detailing the structures of the available compounds. Meta Node ExtractInfoForSubset was used to reintroduce information from the assay summary, regarding the activity profile and curve descriptors. A row filter was utilised to select entries with PUBCHEM_ACTIVITY_OUTCOME ‘Active’, and Meta Node GoodCurveDescription was used to remove entries with unreasonable curve descriptors, where efficacy >150 was calculated from partial curves. Finally, Meta Node DiversityCluster was employed in attempt to cluster the compounds into series of similar structures and the data was output to an interactive table.

Figure 2.1. Overview of KNIME242_{workflow used to filter results of hit confirmation assay.} ExtractInfoForSubset

In Meta Node ExtractInfoForSubset, tautomers were generated of the compounds from both the original hit confirmation assay and the SDF file of those available for purchase. To the former an Orig_number column was introduced using RowID to enable later identification of origin and to the latter, a constant value column to infer availability. The two datasets were then concatenated, grouped by Unique_SMILES, filtered on availability from preferred suppliers, and grouped back by Orig_number to give individual entries for each compound input. Finally, these individual entries were recombined with the non-structural information from the assay summary by concatenating the datasets and again grouping by Orig_number and filtering on availability from preferred suppliers.

Figure 2.2. Outline of KNIME242_{ExtractInfoForSubset Meta Node.} GoodCurveDescription

In Meta Node GoodCurveDescription those entries with Curve_Description ‘Partial curve; high efficacy’ were selected and filtered to retain only those with Efficacy lower than 150, before

recombining with entries with Curve_Descriptions ‘Complete curve; high efficacy’, ‘Complete curve; partial efficacy’ and ‘Partial curve; partial efficacy’.

Figure 2.3. Outline of KNIME242_{GoodCurveDescription Meta Node.} DiversityCluster

In Meta Node DiversityCluster, initially the RDKit Diversity Picker was used to select a diverse set from the data. A loop was initiated in which the fingerprints of the diverse molecules were compared to those of the complete dataset to calculate the Tanimoto similarity coefficient, the coefficients were sorted into descending order and entries with coefficients below 0.5 were removed. To collect only results which had been successfully clustered, the number of results of each iteration were counted and those with two or more associated results were selected. A second loop was then used to recollect the compounds that were identified in those iterations.

Figure 2.4. Outline of KNIME242_{DiversityCluster Meta Node.} 2.5.2 Retrieve and React Isothiazolones

Compounds identified by a ChEMBL159_{search against KAT2B were filtered manually and only}

those with reasonable reported activity against the KAT2B HAT domain (< 150 μM) were retained. A KNIME242_{workflow was used to select and sort the isothiazolones from this list and}

then convert them to the mercaptoacrylamide species, as required for docking.

Of the 110 compounds input, 65 were selected as isothiazolones. These were separated using the Rule-based Row Splitter into those with and without reported IC50 values. For both subsets

the String To Number functionality was used to convert the ChEMBL ID to a number and duplicates then removed using GroupBy ChEMBL ID. Following this, the Affinity was renamed either as IC50 or as % inhibition, before concatenating to afford one table. This table was then:

saved as an SDF file, sorted by IC50 and saved as a PDF file for printing, and passed to Meta Node

Figure 2.5. Overview of KNIME242_{workflow used to retrieve and react isothiazolones.} ReactToMercaptoacrylamide

In Meta Node ReactToMercaptoacrylamide the JChem reactor node, JChem 6.1.0, 2013, ChemAxon (http://www.chemaxon.com) was used to convert the isothiazolones to the mercaptoacrylamide species, before converting and saving to SDF.

Figure 2.6. Outline of KNIME242_{ReactToMercaptoacrylamide Meta Node.} 2.5.3 Generate Conformers

A KNIME242_{workflow was designed to generate six conformers of each compound from a library}

saved in SDF format, give each a unique identifier and output another SDF file.

The list of structures was imported from the SDF file. It was then necessary to convert the structures to a format recognised by the RDKit Add Conformers node. A loop was utilised to generate six conformers for each compound in turn and assign each a unique identifying number. Outside the loop this number was incorporated into the compound reference number, to generate a single unique identifier for each conformer, before saving to SDF.

Figure 2.7. Outline of KNIME242_{Workflow Used to Generate Conformers.}

2.5.4 Select Highest Scoring Docking Pose and Retrieve Top 40% of Compounds

A KNIME242_{workflow was designed to group conformers with multiple poses generated by}

docking, retaining only the highest scoring pose for each compound, sort these by an associated docking score, retrieve the top scoring 40% and output to an SDF file.

The list of docked structures was imported from an SDF file. The conformers were grouped, retaining the first (best) docking pose and associated score. The resulting data table was joined with that from a compound catalogue, to retrieve the associated catalogue numbers. The data was sorted by ascending docking score and a ranking column was inserted, dependent on the row index. The rows were filtered to retrieve those where RANK/0.4 < ROWCOUNT, i.e. the 40% with the best docking scores. The results were written to an SDF file.

Figure 2.8. Outline of KNIME242_{Workflow Used to Select Highest Scoring Docked Pose and Retrieve Top 40% of}

Compounds by Docking Score.

2.5.5 Generate Final List of Docked Hit Compounds

A KNIME242_{workflow was designed to first split a list of docked compounds according to two}

columns detailing whether they were interesting because they made new interactions with the protein, or should be ignored because the pose was unreasonable. Those compounds that instigated new interactions were then combined with the top 20% of compounds, by docking score, which did not instigate new interactions but were not unreasonable.

The list of docked structures was imported from an SDF file. The compounds were split depending on whether they formed new interactions. Those that did not form new interactions were split again to remove any marked to be ignored. Those not marked to be ignored were ranked by ascending docking score, dependent on the row index, and the rows were filtered to retrieve those where RANK/0.2 < ROWCOUNT, i.e. the 20% with the best docking scores. The top scoring 20% were combined with the compounds that formed new interactions to generate a final hit list, which was sorted by ascending docking score and saved to SDF.

Figure 2.9. Outline of KNIME242_{Workflow Used to Split List of Compounds According to Two Columns and Partially}

In document Design and characterisation of tool inhibitors of DNA damage response proteins (Page 67-72)