ITERATIVE SCREENING
4.3 DATA MINING TECHNIQUES
4.3.1 Clustering and Partitioning
Clustering algorithms have been, and continue to be, widely used for pound classifi cation [8 – 10] and for both diversity - and activity - oriented com-pound selection. Partitioning algorithms [12] are applied for the same purposes but do not have such a long history in chemoinformatics as clustering methods.
Clustering and partitioning methods are often not clearly distinguished in the literature, although they do have a principal difference that is relevant for compound classifi cation and selection: regardless of their algorithmic details, clustering methods involve at some stage pairwise distance or similarity com-parisons, whereas partitioning algorithms do not; rather, they generally create coordinate systems in chemical reference spaces into which test compounds fall based on their calculated feature values. As a consequence, partitioning methods can be applied to much larger data sets than conventional clustering techniques, which has become particularly relevant during the age of combi-natorial library generation. Both clustering and partitioning methods repre-sent a form of unsupervised learning and thus do not require training sets of known active compounds [43] . Instead, they organize a chemical library into subsets of compounds that are similar according to a chosen metric, given a chemical reference space. Clustering and partitioning are often applied to cover available chemical space by selecting representative compounds from all clusters or partitions. Accordingly, these methods have also been adapted for sequential screening where representative compound subsets are initially selected from a library and are tested to identify novel hits. Then, during iterative rounds, newly identifi ed hits are added to the classifi cation process to select similar compounds from the library for further evaluation [13,44] . Thus, sequential screening integrates diversity - and activity - oriented com-pound selection schemes.
As already mentioned above, clustering depends on the calculation of intermolecular distances in chemical reference spaces, whereas partitioning is based on establishing a consistent reference frame that allows the independent assignment of coordinates to each database molecule. With the increasing size of data sets, clustering algorithms can become computationally expensive, if not prohibitive. This is especially the case for hierarchical clustering methods where all intermolecular distances need to be considered. For hierarchical agglomerative clustering methods, clusters are obtained by iteratively combin-ing smaller clusters to form larger ones, beginncombin-ing with scombin-ingletons (i.e., single - compound clusters). By contrast, hierarchical - divisive methods start with a single large cluster consisting of all compounds and iteratively split clusters into smaller ones [45] . Besides distinguishing between top – down and bottom – up approaches, hierarchical clustering methods differ in they way by which intercluster distances are measured. Popular methods consider either the minimum, maximum, or average distance of compounds of two clusters.
For example, Ward ’ s clustering algorithm minimizes intracluster variance and maximizes intercluster variance and thus attempts to minimize the increase in information loss when joining clusters [46] .
Nonhierarchical methods are generally faster but require to preset the total number of clusters, as in k - means clustering [47] , or defi ne what con-stitutes a neighborhood , as in Jarvis – Patrick clustering [48] . Cell - based partitioning methods [12,49] and variants like median partitioning [50] are an attractive alternative because of their computational effi ciency. These methods assign molecules to cells defi ned by a combination of descriptor ranges. A prominent and widely applied supervised learning variant for clas-sifi cation problems is the recursive partitioning approach [51,52] . Recursive partitioning divides compound sets along decision trees and attempts to gen-erate homogeneous subsets at the leaves, thereby separating molecules according to activity.
4.3.2 Similarity Searching
Like compound clustering, similarity searching is among the most widely employed approaches in chemoinformatics. The notion of compound similar-ity and the search for similar molecules are at the core of ligand - based virtual screening concepts. Since the explicit formulation of the similarity property principle , which simply states that similar molecules should have similar bio-logical activities [53] , a wide variety of concepts of what constitutes molecular similarity have been developed, and a multitude of computational methods for identifying similar molecules in compound databases have been devised.
In its most basic form, similarity searching detects common 2 - D substructures in the chemical graphs of molecules [25] . As mentioned above, these graph based approaches are computationally quite expensive, and the need for more effi cient alternatives has boosted the popularity of fi ngerprints to a large extent. Another reason for the popularity of fi ngerprints is that they can be used to generate search queries if only single bioactive compounds are avail-able as templates, in contrast to other compound classifi cation approaches including machine learning methods.
As discussed above, fi ngerprints abstract from the chemical structure and make searching of large databases feasible. Importantly, they decouple simi-larity assessment from direct structural comparisons through the evaluation of bit string similarity. In general, fi ngerprint - based similarity evaluation depends on two criteria: the type of fi ngerprint that is used and the chosen similarity measure. Fingerprints can often be easily modifi ed. For example, for structural fi ngerprints, Durant et al. [32] systematically investigated subsets of the MDL keys for their ability to detect molecules having similar activity in order to optimize sets of structural keys for similarity searching.
In addition to differences in fi ngerprint design, there also is a variety of similarity measures available for fi ngerprint comparison [54] including, for example, the Hamming and Euclidean distance. For binary vectors, the
Hamming distance simply counts the number of bit differences between two fi ngerprints and the Euclidean distance is the square root of the Hamming distance. Most popular in chemical similarity searching is the Tanimoto or Jaccard coeffi cient, which accounts for the ratio of the number of bits set on in both fi ngerprints relative to the number of bits set on in either fi ngerprint.
The Hamming and Euclidean distances equally account for the presence or absence of features, i.e., binary complement fi ngerprints yield the same dis-tance, whereas the Tanimoto coeffi cient only takes into account bit positions that are set on. For instance, if we consider two fi ngerprints where 75% of all bits are set on and the two fi ngerprints overlap in 50% of these bits, then a Tanimoto coeffi cient of 0.5 is obtained. However, if we take the complement instead, i.e., count the absence of features instead of their presence, a Tanimoto coeffi cient of 0 is obtained because there is no overlap in missing features (i.e., bit positions set to zero).
Similarity measures enable the ranking of database compounds based on similarity to single or multiple reference compounds and, in successful applica-tions, achieve an enrichment of novel active molecules among the top - ranked compounds. However, the most similar compounds are typically analogues of active reference molecules, and one can therefore not expect to identify diverse structures having similar activity by simply selecting top - ranked data-base compounds. For the identifi cation of different active chemotypes, similar-ity value ranges where scaffold hopping occurs must be individually determined for combinations of fi ngerprints and similarity coeffi cients, which represents a nontrivial task.
In part due to the availability of large databases consisting of different classes of bioactive compounds like the MDDR (a database compiled from patent literature) [55] or WOMBAT [56] , similarity searching using multiple reference molecules has become increasingly popular and is typically found to produce higher recall of active molecules compared with calculations using single templates. These fi ndings are intuitive because the availability of multi-ple compounds increases the chemical information content of the search cal-culations. For fi ngerprint searching using multiple reference molecules, different search strategies have been developed that combine compound information either at the level of similarity evaluation or at the level of fi n-gerprint generation. One principal approach is data fusion, which merges the results from different similarity searches [57 – 60] either by fusing the search results based on the rank of each compound or by using the compound score.
This can be achieved, for example, by considering only the highest rank of a database compound relative to each individual template, by calculating the sum of ranks, or by averaging the similarity values of the nearest neighbors.
At the level of fi ngerprints, information from multiple reference molecules can be taken into account by calculation of consensus fi ngerprints [61] , scaling of most frequently occurring bit positions [62] , or by determining and using feature value ranges that are most characteristic of template sets [63,64] . Similarity searching is clearly not limited to the use of fi ngerprint descriptors.
As stated above, reduced graph representations or pharmacophore models are also employed.
Having discussed clustering and similarity search techniques that have a long history in chemical database analysis, in the following, we will focus on three data mining approaches that are based on statistics and machine learn-ing. Because these data mining approaches have in recent years become increasingly popular in chemoinformatics, we discuss their theoretical founda-tions in some detail.