In order to confirm that our models lead hop, we provide a visual analysis of each cluster. In this Section, we will draw molecules which are closest to the cluster center. We begin by describing a method to calculate the center of a cluster where distance between examples is calculated with the Tanimoto similarity in Section 6.4.1. After, we graph 10 active molecules which are closest to the cluster center. Analysis of clusters (splits) from dataset 1 are given in Section 6.4.2 and clusters from dataset 2 are given in Section 6.4.3.
6.4.1 Calculating the Tanimoto Cluster Center
Using a basic clustering algorithm, such as K-means, the cluster center would be given by the average feature vector for the examples in a cluster. More formally, given a set of N fingerprints, F1, F2, . . . , FN, the cluster center is calculated as:
Fcenter= F1+ F2+ . . . + FN N (6.3)
Although using a complete link clustering algorithm, as described in the previous Section, a different similarity can be defined between examples such as the Tanimoto similarity (eq (6.1)). As a result, the cluster center is not the same as eq (6.3). We define the center of the cluster using a point, ˆF , that maximizes the Tanimoto similarity between
ˆ
F and each fingerprint in the cluster. Fcenter = max ˆ F N X i=1 T ³ Fi, ˆF ´ (6.4) = max ˆ F N X i=1 D Fi, ˆF E hFi, Fii + D ˆ F , ˆF E − D Fi, ˆF E (6.5)
This can be solved by taking the gradient of eq (6.5) and then iteratively estimating closer solutions. Ft+1= Ft+ ²∇F N X i=1 T (Fi, Ft) (6.6)
Setting an appropriate step size, ², and iterating until |Ft+1− Ft| < tol, we can find
the cluster center. A fingerprint is composed of components F = (f1, f2, fn) so ∇F is:
∇F = µ ∂F ∂f1, ∂F ∂f2, . . . , ∂F ∂fn ¶ (6.7)
Chapter 6 Experiments in Lead Hopping 86 We will use an initial guess for F0 using eq (6.3). The gradient is given by:
∇F N X i=1 T (Fi, Ft) = Fi(hFi, Fii + hFt, Fti − hFt, Fii) − hFt, Fii (2Ft− Fi) (hFi, Fii + hFt, Fti + hFi, Fti)2 (6.8) = FihFi, Fii + FihFt, Fti − 2FthFi, Fti (hFi, Fii + hFt, Fti + hFi, Fti)2 (6.9)
6.4.2 Cluster Centers of Dataset 1 - NCI-HIV Data
In this Section, we describe the 10 closest active molecules to the cluster center of splits in dataset 1. The data was originally separated into two clusters using fingerprints of the molecular graph and the Tanimoto similarity (described in Section 6.2). We calculate the center of each cluster using an iterative approach described in Section 6.4.1. The center of both cluster in dataset 1 could be found in around 50 iterations by setting ² = 0.01 and tol = 0.001.
The closest 10 active molecules in split 1 to the cluster center are given in Figure 6.2. A common pattern among these actives appears to be the presence of two aromatic rings separated by a group of 1 Sulphur and 2 Oxygen atoms (seen in molecules 633001, 624231, 629267, 633011 and 667950). Also, there is a group of atoms (1 Nitrogen double bonded to 2 Oxygen atoms) attached as side groups to aromatic rings in molecules 629273, 633001, 624231, 629267 and 667950. Molecule 7229, 638478, 641295 do not share these properties, although as they are large molecules with many rings, they will have many overlapping features with other molecules according to their Tanimoto fingerprint similarity. Overall, all of these molecules are similar according to the clustering algorithm and molecular fingerprints.
Chapter 6 Experiments in Lead Hopping 87
Figure 6.2: The figure gives the 10 closest molecules to the cluster center of split 1, dataset 1 (NCI-HIV data). Molecule names are given above each molecular graph. They are sorted from the closest in the top-left (molecule 629273) to the 10-th closest
in the bottom-right (molecule 667950).
The closest 10 active molecules in split 2 to the cluster center are given in Figure 6.3. There appears to be two groups of similar molecules. The first set includes molecules 647648, 637646 and 646444, sharing a similar central structure of two rings. They are only different by a few groups of atoms attached at the side. The second set includes molecules 50848 and 50850. Again, these are almost identical except for a few atoms attached at one end of the molecule. The remaining five (624151, 695836, 128701, 684881 and 318534) do not appear to be similar visually, although they are grouped together in the same cluster according to their Tanimoto similarity from molecular fingerprints
Chapter 6 Experiments in Lead Hopping 88
Figure 6.3: The figure gives the 10 closest molecules to the cluster center of split 2, dataset 1 (NCI-HIV data). Molecule names are given above each molecular graph. They are sorted from the closest in the top-left (molecule 624151) to the 10-th closest
in the bottom-right (molecule 50850).
6.4.3 Cluster Centers of Dataset 2 - Pyruvate Kinase Data
In this Section, we will show the 10 closest active molecules to the cluster center of dataset 2. Using an iterative approach described in Section 6.4.1, the center of both clusters in dataset 2 could be found in around 50 iterations by setting ² = 0.01 and tol = 0.001.
The closest 10 active molecules in split 1 to the cluster center are given in Figure 6.4. Two pairs of molecules (5356192,5771437 and 445154,3238641) appear to match visually, although the remaining molecules do not. Despite this, they are grouped together in the same cluster according to the Tanimoto similarity from their molecular fingerprints.
Chapter 6 Experiments in Lead Hopping 89
Figure 6.4: The figure gives the 10 closest molecules to the cluster center of split 1, dataset 2 (Pyruvate Kinase data). Molecule names are given above each molecular graph. They are sorted from the closest in the top-left (molecule 660143) to the 10-th
closest in the bottom-right (molecule 315109).
The closest 10 active molecules in split 2 to the cluster center are given in Figure 6.5. Besides one pair (2195987,2193330), the remaining molecules do not appear to match visually. Despite this, they are grouped together in the same cluster according to the Tanimoto similarity from their molecular fingerprints.
Chapter 6 Experiments in Lead Hopping 90
Figure 6.5: The figure gives the 10 closest molecules to the cluster center of split 2, dataset 2 (Pyruvate Kinase data). Molecule names are given above each molecular graph. They are sorted from the closest in the top-left (molecule 694854) to the 10-th
closest in the bottom-right (molecule 660799).