control weeks for all groups
5.2.6 Data extraction and statistical analysis
The eluants in the first 1.5 min and during the last 5 min of the HILIC and lipid runs were considered
as waste; the remaining data were extracted and aligned. A similar procedure was performed for
the components eluting between 3 and 14 min from the C18 column using in‐house proprietary
software developed by AgResearch. The resulting peak‐area matrix data was normalised against the
average of all peak areas to remove any effects from batches or run‐orders, as well as to correct for
artefacts from the peak detection, using a number of statistical packages as well as the in‐house
proprietary software.
Raw data was subjected to a peak detection process using PhenoAnalyzer (SpectralWorks Ltd,
Manchester, UK) with key parameter settings described previously (Fraser et al., 2014). The
resulting peak data matrices were noisy, and necessitated implementation of a series of quality
control filtering techniques before statistical analyses could be conducted. These filtering steps
included:
1) missing value treatment: the peaks missing in > 90 % of the samples were removed. Any other
missing values were replaced by half of the minimum of the non‐missing values of that peak.
2) peak de‐isotoping and merging: based on criteria that a series of ions, for example for the
positive ionisation [M+H], [M+H]+1, [M+H]+2, must be (a) eluting at the same retention time (RT)
with ± 0.1 min, (b) within ± 1.005 m/z error zone, (c) their peak intensities are highly correlated
among samples (Pearsons correlation coefficient > 0.9), and (d) with a measured peak intensity ratio
[M]/[M + H]+1 > 2.0, and [M+H]+1/[M+H]+2 > 2.0 (if peak [M+H]+2 is present, 2.0 is only an
emipirical threshold). Only the monoisotopic ion [M] of the series was retained for further quality
control analysis.
3) run‐order correction within each batch was normalized using linear regression (Koulman et al.,
2007). This was to account for changes in the instrument response across batches.
4) batch effect was corrected using a parametric empirical Bayes methods (Johnson et al., 2007),
and, thereafter, the peaks still showing significant batch effect (F‐test, p‐values < 0.05) were
removed. Peak annotation processes were not relevant for this analysis.
The data were log transformed prior to any multivariate analyses as it did not follow a normal
distribution. PCA analyses on raw peak area data showed large variations within and between
batches. Re‐doing the analysis using the normalised (log transformed) data showed that the
After the pre‐processing, a total of 450 m/z_RT pairs were detected within the (‐) ESI
chromatograms and 1278 m/z_RT pairs in the (+) ESI chromatograms. These pairs were used to
create data matrices of peak areas of target ‘features’ for all samples for each ionisation mode. The
matrix tables were produced with the predictor variables (m/z_RT values) in the X‐block and the
response variables (group, day, dosing) in the Y‐block. The resulting matrices were then analysed
using SIMCA (Version 13.0.2.0, Umetrics AB) and RStudio statistical software (Version 0.97.449,
RStudio, Boston, MA, USA). Initially the data were reduced using principal component analysis (PCA),
then further analysis by partial least squares discriminate analysis (PLS‐DA), and orthogonal PLS‐DA
(OPLS‐DA). A brief summary of the processes are given below; see sections 4.2.4 and 4.2.5 for detail
descriptions of the multivariate statistics and time series methods.
Hotellings T2 plots, from initial PCA modelling, were used to identify any outliers in the data. The T2
critical (99 %) line represents the 99 % tolerance for the data in each model. A T2 critical line, based
on a 95 % tolerance is also represented in the scores plot as the ellipse surrounding most samples
(Wheelock & Wheelock, 2013). Any observations situated outside this ellipse or outside the T2
critical line are outliers. These observations were checked for their sample ID, and the raw spectra
were examined to try and identify a reason for this. Not all samples outside the T2 tolerances were
excluded from further analysis, as these may symbolise significant changes associated with the
degree of effect of a sporidesmin challenge. Rather, observations were excluded from further
analysis when the spectra showed poorly resolved peaks, low signal to noise ratio, and/or a failed
injection. PLS‐DA and OPLS‐DA models were then produced using defined classes to aid in
separation of the data based on specific observations. For example, samples were classed by those
measured before dosing (Days ‐14 – 0) compared to those measured after dosing (Days 7 ‐ 42).
RStudio was then used to produce time series analyses. For the time series analyses, plots were
produced showing the m/z_RT pair intensity versus time curves for each cow, to compare the time
profiles across the sample population. The time profile of each m/z_RT pair was extracted for each
cow, and this was used to calculate the average time profile for the groups to which they belonged.
If necessary, missing time points were substituted by linear interpolation for internal points, and
replicating of the nearest data for missing end points. This information was then used to see which m/z_RT pairs showed large differences in the shape of the time curves when comparing groups,
while taking into account how much variation there was between cows within each group. A p‐value
and SDA‐rank were produced for each of these, combined, and then ranked to identify which
variables contributed the most to difference between the samples over time. The p‐values were not