3. Materials and Methods
3.3.5 Classification SAR Modelling Methods
The limitation in the number of data samples is an unavoidable problem with the metal oxide NM databases. The restriction of size is a challenge in building a reliable model of genotoxicity with high prediction accuracy. Even if a high number of molecular descriptors is calculated for the small data set of NMs, still we deal with the issue of “under sampling induced collinearity”, which means a high degree of collinearity in descriptors 146,147. Collinearity will be present in
the model as the number of samples is very small compared to the number of descriptors. Additionally, other problems such as over-fitting and noise in the data with negative effects on the model will arise. Considering the abovementioned complications, in order to find the most
67
appropriate model to fit the data, it is better to focus on a limited set of hypotheses. In other words, in case of small data, it is better to start from a small set of possible hypotheses, e.g. a set of decision trees with depth <= four. Thus, we opted for a simple tree classification analysis for (Q)SAR modelling of our data set, in particular, Recursive Partitioning and Regression Trees (rpart) model was used to classify the data set.
3.3.5.1 Recursive Partitioning and Regression Trees
The rpart programs build classification and regression models in two phases and the result is a binary tree. To build the tree classification model the first phase is identifying the variable which contributes the most to the splitting the data into two groups. After dividing the data into two groups, the algorithms continues the splitting separately for each group. The procedure continues recursively until each group contains a minimum number of samples or no more improvement can be achieved. During the second phase, a cross-validation evaluation is performed on the data to trim the full tree and make is simpler 148.
Considering the small data set of metal oxide NMs with their associated set of their quantum- mechanical descriptors and the classification endpoint we need to model, the factor of
“randomness” is likely to play a role in the built model. To overcome this situation, we decided to develop a model to analyse the importance of each variable in relationship with the
genotoxicity property of the NMs, rather than a model to estimate the genotoxicity of the metal oxide NPs. The (Q)SAR models in addition to their predictive ability, help us to identify the more effective physico-chemical attributes of a chemical related to toxicological and biological properties of the substances. In the present study, (Q)SAR models are employed to study the effect of each quantum-chemical descriptors in amplifying or reducing the genotoxicity of the NMs. Considering the limitations mentioned above, we decided to use all the data as training set and study the importance of each descriptor in amplifying the genotoxicity property of the metal oxide NPs. All the quantum-chemical descriptors have been standardized in the data set prior to the modelling process. All analyses were done in R version 3.2.3 (R Foundation for Statistical Computing, Vienna, Austria), using the ‘rpart’ library.
68
Table 4. Criteria for the usefulness and quality assessment of the data set for the (Q)SAR
modelling: extent of Comet assay conditions checklist. General parameters have been used to assess each data point and the results are reported in Table S1.A (Appendices) where all questions are answered in a yes or no fashion.
General parameters Further details to assess
Comet protocol type:
I) The pH of unwinding: alkaline, neutral, very alkaline.
II) Incubation with the enzymes: FPG, 8oxodG, Endo III.
Concentrations expressed in at least one of the units:
I) Mass per volume, per area, per cell (µg/ml, µg/cm2, µg/cell)
II) Number of NMs per ml, per cm2, per cell (ENMs/ml or ENMs/cm2 or
ENMs/cell)
III) Surface area per ml, per cm2, per cell (cm2/ml or cm2/cm2 or cm2/cell) Cytotoxicity tests performed?
Performed trend test for dose-response relationship?
Microscopic analysis in the Comet assay: Analyzed at least 50 Comets per gel divided on two different slides (parallel gels per sample)? Comet count performed at least by one of the methods?):
I) % DNA in the tail II) Tail length III) Tail moment
IV) Tail intensity (classified as belonging to one of five classes depending on their tail intensity?)
At least 3 hours for treatment time was respected?
Performed comparison between treated samples
and controls? I) Positive control
II) Negative control
III) Both negative and positive controls Information on uptake (demonstrated cellular
69
Table 5. Comet assay experimental results for all selected metal oxide nanomaterials used for
(Q)SAR modelling*.
No Metal oxide Number of genotoxic reports Number of non- genotoxic reports Overall assessment** 1 Al2O3 1 1 + 2 NiO 1 + 3 Co3O4 2 + 4 CuO 6 2 + 5 Fe2O3 1 5 - 6 Fe3O4 6 3 + 7 TiO2 32 6 + 8 ZnO 16 1 + 9 SiO2 3 9 - 10 V2O3 1 + 11 V2O5 1 - 12 MgO 1 - 13 ZrO2 1 - 14 CeO2 5 1 + 15 Bi2O3 1 + 16 SnO2 1 -
* Data were extracted from 128.
** The “positive” and “negative” signs are assigned according to the number of genotoxic and
nongenotoxic “reports” per each NM. The assessment column represents the variable used to model, based upon the global evaluation (weight of evidence) of all the reports related to a single NM (i.e. row): “+” means positive, i.e. genotoxic, whereas “-“means negative, i.e. not genotoxic.
70
Table 6. Acronyms, short definitions and units of the molecular descriptors calculated by
MOPAC2012.
Symbol Descriptors Unit
HF Heat of formation Kcal/mol
TE Total energy of the oxide cluster Ev
EE Electronic energy of the oxide cluster Ev
Core Core-core repulsion energy of the oxide cluster Ev
COSMO Surface charge distribution based on Conductor-like Screening Model
Cubic Angstroms COSMO-
SA Area of the oxide cluster calculated based on COSMO
Square Angstroms
IP Ionization Potential Ev
HOMO Energy of the highest occupier molecular orbital of the
oxide cluster Ev
LUMO Energy of the lowest unoccupied molecular orbital of the
oxide cluster Ev
No.Fl Number of Filled Levels adimensional
71
3.4 Weight of Evidence Approach in the Analysis of Results of Different In Silico Methods