Bayesian Statistics - Creation of New Balanced Data Sets

2. Automated workflow for construction of structural alert-based structure-activity

2.1. Creation of New Balanced Data Sets

2.2.1. Bayesian Statistics

In Bayesian statistics, probability is a description of how certain you are that something is

true.87,88_{If you are very sure of something, new data is unlikely to change your mind. Bayesian}

statistics is a broad subject which has been applied in many different ways for many purposes. Madigan et al. have previously reviewed some applications of Bayesian statistics in

pharmacology.89

In this chapter, SAR models were constructed by iteratively selecting common substructures occurring in training chemicals to be coded as structural alerts. At each iteration, one could observe how many active and inactive chemicals contain different substructures. How could this be used to systematically pick the best performing structural alert?

Simply using the ratio of occurrence in actives to occurrence in inactives, as used in SARpy,74

would result in near-exclusive selection of substructures which occur in no inactives. For example, a substructure which occurred in two actives and zero inactives would be picked as a structural alert ahead of a substructure which occurred in 200 actives and one inactive.

As discussed in Section 1.7, previous statistical approaches have used a significance level test with a single binomial distribution to identify activating or non-activating substructures. If the probability of randomly producing the observed distribution of active and inactive chemicals is sufficiently low, the substructure is identified as either activating or non-activating. This approach does not allow for any adjustment in relative weighting of active and inactive chemicals. An alternative approach is, for a given distribution of active and inactive chemicals contained by a substructure, to compare the probabilities of the distribution being given by two models - one model being biased towards active chemicals and the other model being random. The greater the probability given by the biased model compared to the random model, the more activating a substructure is. Changing how biased the biased model is changes the relative weight of active and inactive chemicals - the greater the bias, the more active chemicals required per inactive chemical. This comparison in probability from competing models can be done by calculating Bayes Factor in Bayesian statistics. Using Bayesian statistics for this model comparison allows for more flexibility in how the problem is approached. Bayesian statistics do not require the bias of the biased model to be fixed but allow it to be treated as unknown parameter to be fit to a distribution. However, if the bias model is fixed, the equation derived from Bayes statistics is

Neyman-Pearson lemma is that the fixed model and the prior likelihood of each model occurring have been explicitly stated, and these can be changed later. In this work, only the effect of using different fixed biases will be explored. The model comparison will be set up as a Bayesian equation so that the bias can easily be fitted to a distribution in future.

2.2.1.1. Bayes Theorem

Bayes Theorem, when considering the appropriateness of a model, Mi, for given data, D, is written as

𝑝(𝑀𝑖|𝐷) =

𝑝(𝐷|𝑀_𝑖)𝑝(𝑀_𝑖)

𝑝(𝐷)

Where: 𝑝(𝑀_𝑖|𝐷) is probability of model i occurring given data

𝑝(𝐷|𝑀_𝑖) is probability of data occurring for the model i

𝑝(𝑀𝑖) is probability of the model i occurring

𝑝(𝐷) is probability of the data occurring

When a second possible model, Mj, is considered, this becomes:

𝑝(𝑀_𝑖|𝐷) 𝑝(𝑀𝑗|𝐷) =𝑝(𝐷|𝑀𝑖) 𝑝(𝐷|𝑀𝑗) 𝑝(𝑀_𝑖) 𝑝(𝑀𝑗)

Where Bayes Factor is defined as:

𝐵𝑎𝑦𝑒𝑠 𝐹𝑎𝑐𝑡𝑜𝑟 =𝑝(𝐷|𝑀𝑖)

𝑝(𝐷|𝑀_𝑗)

𝑝(𝐷|𝑀_𝑖) and 𝑝(𝐷|𝑀_𝑗) can be calculated, and so Bayes Factor is a value that can be calculated.

Assuming that 𝑝(𝑀_𝑖) = 𝑝(𝑀_𝑗), i.e. that before any data has been considered, both models are

equally likely to occur:

𝑝(𝑀_𝑖|𝐷)

𝑝(𝑀𝑗|𝐷)

= 𝐵𝑎𝑦𝑒𝑠 𝐹𝑎𝑐𝑡𝑜𝑟

Hence, the likelihood of two models occurring for given data can be compared by calculating Bayes Factor. Bayes Factor indicates how many times more likely one model, Mi is compared to another model, Mj.

2.2.1.2. Using Bayes Factor to pick structural alerts

For each substructure, the given data (D) is the number of actives containing the substructure and number of inactives containing the substructure.

Bayes factor is calculated, comparing between two Binomial models defined as:

• Mbias – a model which is bias towards active predictions. Mbias predicts active with a

probability of θ and inactive with a probability of (1 – θ), where 0.5 < θ < 1.

• Mrandom – a model which predicts activity randomly. It predicts active with a probability of 0.5 and inactive with a probability of 0.5

𝐵𝑎𝑦𝑒𝑠 𝐹𝑎𝑐𝑡𝑜𝑟 = 𝑝(𝐷|𝑀𝑏𝑖𝑎𝑠)

𝑝(𝐷|𝑀_{𝑟𝑎𝑛𝑑𝑜𝑚})

𝐵𝑎𝑦𝑒𝑠 𝐹𝑎𝑐𝑡𝑜𝑟 = 𝜃

𝑎𝑐𝑡𝑖𝑣𝑒𝑠_{(1 − 𝜃)}𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒𝑠

0.5𝑎𝑐𝑡𝑖𝑣𝑒𝑠_0.5𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒𝑠

Where "𝑎𝑐𝑡𝑖𝑣𝑒𝑠" is the number of actives containing the substructure, and "𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒𝑠" is the number of inactives containing the substructure.

The value for theta (θ) must be selected by the user and will result in different priorities when selecting from a list of substructures. This is explored further in later sections.

Taking the logarithm of the previous equation made it easier to handle large values of 𝑎𝑐𝑡𝑖𝑣𝑒𝑠 and 𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒𝑠:

log(𝐵𝑎𝑦𝑒𝑠 𝐹𝑎𝑐𝑡𝑜𝑟) = 𝑎𝑐𝑡𝑖𝑣𝑒𝑠 × log 𝜃 + 𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒𝑠 × log(1 − 𝜃) − (𝑎𝑐𝑡𝑖𝑣𝑒𝑠 + 𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒𝑠) log 0.5 Simply put, Bayes Factor can be viewed as a scoring system for each substructure. The more actives containing a substructure, the higher the value of Bayes Factor. The more inactives containing a substructure, the lower the value of Bayes Factor. Adjusting the value of θ changes the relative scoring of active and inactive – a greater value of θ will result in greater increases in Bayes Factor from active chemicals and greater decreases in Bayes Factor from inactive chemicals.

In document Structure-based Predictions for Molecular Initiating Events (Page 52-55)