Development of Linear QSPRs - Development of Computer-Aided Molecular Design Methods for Bioeng

Development of linear QSPRs is comprised of four steps:

1. Identification of key properties and procurement of experimental property values 2. Calculation of descriptors and descriptor selection

3. Linear regression 4. Cross-validation

The steps are further detailed in the following sections.

4.2.1 Glass Transition Properties

QSPRs were developed for the glass transition temperature of the anhydrous solute, glass transition temperature of the maximally concentrated solute, melting point of ice and Gordon-Taylor constant for carbohydrates. The experimental data were collected from published literature(Roos 1993). Discussion on glass transitions and their importance in lyophilized protein formulations is given in detail in Section 2.5.

4.2.2 Percent Monomer Remaining Following Lyophilization

The quantitative measure of aggregation used for property modeling was percent monomer remaining following lyophilization. Measurement and calculation of percent monomer is detailed in Section 3.3.

Linear QSPRs for percent monomer were developed either as a function of protein structure (formulation-by-formulation basis) or as a function of excipient structure (protein-by-protein basis).

Additionally, a non-linear QSPRs for percent monomer was developed as a function of both protein and excipient structure (see section 4.3). Protein-based descriptors were used to represent protein structure and chiral-corrected connectivity indices were used to represent excipient structure.

4.2.3 Properties for in situ NDHD recovery

The example case used for in situ product recovery during fermentation is the production of (1R,2S)-1,2-naphthalene dihydrodiol (NDHD) by Escherichia coli. NDHD is an important intermediate product that can be used in synthesis of pharmaceutical intermediates or in synthesis of polymers (Raschke, Meier et al. 2001). The reaction producing NDHD that is performed by E. coli is given in Figure 4.4 (Jerina, Daly et al. 1971).

Figure 4.4 Oxidation of naphthalene to (1R,2S)-1,2-naphthalene dihydrodiol (NDHD) The reaction is catalyzed by the enzyme naphthalene dioxygenase (NDO), which is present in E. coli. The reaction requires oxygen (O2) and nicotinamide adenine dinucleotide phosphate (NADPH). Adapted from (Jerina, Daly et al. 1971).

The key properties of interest when designing an ionic liquid to extract NDHD during fermentation are the partition coefficient of NDHD between ionic liquid and water (Kx) and the toxicity of the ionic liquid towards E. coli (EC50). Group contribution models proved unable to successfully correlated the properties of interest to molecular structures; accordingly, connectivity index QSPRs were used. The partition coefficient of NDHD between ionic liquid and water is given by ratio between the mole fraction of NDHD in ionic liquid (xIL) over the mole fraction of NDHD in water (xaq), given by Equation 4.3.

(Equation 4.3) Toxicity is measured by the half maximal effective concentration (EC50) value, which represents the concentration that is effective in killing half of a given community of organisms. For the system considered here, EC50 represents the overall toxicity of the ionic liquid towards E. coli. Lower values of EC50 represent a more toxic ionic liquid. Experimental values were obtained for partition coefficient values of 10 ionic liquids and toxicity values of 12 ionic liquids (Scurto 2012).

4.2.4 Descriptor selection

Linear property models were developed relating percent monomer remaining after lyophilization to excipient structure on a protein-by-protein basis. Descriptor selection was performed to prevent over-fitting through use of Mallow’s Cp statistic (see Equation 4.4). Conceptually, Mallow’s Cp statistic is equal to the lack of fit plus a penalty for the number of descriptors chosen (Wasserman 2004).

∑( ^̂)

(Equation 4.4)

Where is the observed or experimental value, ̂ is the predicted value, m is the number of data points, p is the number of parameters or descriptors and is the estimate of the residual variance. The value given by is an unbiased estimate of the variance (Wasserman 2004), shown below:

( ) ∑( ^̂)

(Equation 4.5) When comparing models, the model with the minimal value of Cp represents the model that best correlates the data without over-fitting. For a given model size (number of descriptors), an exhaustive search was performed to select the descriptors that minimized Cp. All model sizes were then compared and the model size with the minimal Cp statistic was selected as the final model. Figure 4.5 gives a graphical example of the use of Cp in descriptor selection. For linear correlations, the selection results given by Cp are equivalent to AIC (Akakie Information Criterion) (Wasserman 2004). Accordingly, descriptor selection results may refer to either AIC or Cp, depending on the software package used for selection. Descriptor selection was performed using the Leaps package in R (Lumley 2004; Dalgaard 2008). The general procedure used in R for descriptor selection along with sample code is given in Appendix B.

Figure 4.5 Values for Mallow’s Cp statistic versus model size (number of connectivity indices used) The lowest value is observed when six connectivity indices are used, indicating the size of the model that should be used. Figure adapted from (Roughton, Topp et al. 2012).

80 4.2.5 Cross-Validation

Once a final model was selected, leave-one-out cross-validation (LOOCV) was performed to evaluate the predictive ability of the model. In LOOCV, one by one, each observation is left out of the data set and then the selected descriptors are again correlated to the data set. The resulting model is then used to predict the left-out data point. The process is repeated for each fold. Upon completion, the predictions are used to calculate the predicted residual sum of the square errors (PRESS) through use of Equation 4.6 (Quan 1988).

∑( ̂_{〈 〉})

(Equation 4.6) where ̂〈 〉 is the predicted value for the left-out observation and m is the total number of observations.

The PRESS value is then used to calculate the cross-validation coefficient Q² (see Equation 4.7).

∑( ̅)

(Equation 4.7) where ̅ is the average value for all data points. Q² has a maximal value of the model’s R² value, which represents a perfect predictive ability (Quan 1988). When comparing models, smaller R²-Q² values represent better predictive power. In general, Q² can be calculated for K-fold cross-validation by expanding upon Equation 4.6 to yield Equation 4.8. In K-fold cross-validation, K number of folds are generated from the original data set. The number of data left-out should be equal for each fold. As K decreases, the predictive power of the model is further strained as fewer observations are used to build the model.

∑ ∑( ̂_{〈 〉})

(Equation 4.8)

where k is the number of folds and n is the number of observations left-out in each fold. The general procedure used in R for cross-validation along with sample code is given in Appendix B.

In document Development of Computer-Aided Molecular Design Methods for Bioengineering Applications (Page 90-95)