Data extraction and statistical analysis - control weeks for all groups

control weeks for all groups

5.2.6 Data extraction and statistical analysis

The eluants in the first 1.5 min and during the last 5 min of the HILIC and lipid runs were considered

as waste; the remaining data were extracted and aligned. A similar procedure was performed for

the components eluting between 3 and 14 min from the C18 column using in‐house proprietary

software developed by AgResearch. The resulting peak‐area matrix data was normalised against the

average of all peak areas to remove any effects from batches or run‐orders, as well as to correct for

artefacts from the peak detection, using a number of statistical packages as well as the in‐house

proprietary software.

Raw data was subjected to a peak detection process using PhenoAnalyzer (SpectralWorks Ltd,

Manchester, UK) with key parameter settings described previously (Fraser et al., 2014). The

resulting peak data matrices were noisy, and necessitated implementation of a series of quality

control filtering techniques before statistical analyses could be conducted. These filtering steps

included:

1) missing value treatment: the peaks missing in > 90 % of the samples were removed. Any other

missing values were replaced by half of the minimum of the non‐missing values of that peak.

2) peak de‐isotoping and merging: based on criteria that a series of ions, for example for the

positive ionisation [M+H], [M+H]+1, [M+H]+2, must be (a) eluting at the same retention time (RT)

with ± 0.1 min, (b) within ± 1.005 m/z error zone, (c) their peak intensities are highly correlated

among samples (Pearsons correlation coefficient > 0.9), and (d) with a measured peak intensity ratio

[M]/[M + H]+1 > 2.0, and [M+H]+1/[M+H]+2 > 2.0 (if peak [M+H]+2 is present, 2.0 is only an

emipirical threshold). Only the monoisotopic ion [M] of the series was retained for further quality

control analysis.

3) run‐order correction within each batch was normalized using linear regression (Koulman et al.,

2007). This was to account for changes in the instrument response across batches.

4) batch effect was corrected using a parametric empirical Bayes methods (Johnson et al., 2007),

and, thereafter, the peaks still showing significant batch effect (F‐test, p‐values < 0.05) were

removed. Peak annotation processes were not relevant for this analysis.

The data were log transformed prior to any multivariate analyses as it did not follow a normal

distribution. PCA analyses on raw peak area data showed large variations within and between

batches. Re‐doing the analysis using the normalised (log transformed) data showed that the

After the pre‐processing, a total of 450 m/z_RT pairs were detected within the (‐) ESI

chromatograms and 1278 m/z_RT pairs in the (+) ESI chromatograms. These pairs were used to

create data matrices of peak areas of target ‘features’ for all samples for each ionisation mode. The

matrix tables were produced with the predictor variables (m/z_RT values) in the X‐block and the

response variables (group, day, dosing) in the Y‐block. The resulting matrices were then analysed

using SIMCA (Version 13.0.2.0, Umetrics AB) and RStudio statistical software (Version 0.97.449,

RStudio, Boston, MA, USA). Initially the data were reduced using principal component analysis (PCA),

then further analysis by partial least squares discriminate analysis (PLS‐DA), and orthogonal PLS‐DA

(OPLS‐DA). A brief summary of the processes are given below; see sections 4.2.4 and 4.2.5 for detail

descriptions of the multivariate statistics and time series methods.

Hotellings T2 plots, from initial PCA modelling, were used to identify any outliers in the data. The T2

critical (99 %) line represents the 99 % tolerance for the data in each model. A T2 critical line, based

on a 95 % tolerance is also represented in the scores plot as the ellipse surrounding most samples

(Wheelock & Wheelock, 2013). Any observations situated outside this ellipse or outside the T2

critical line are outliers. These observations were checked for their sample ID, and the raw spectra

were examined to try and identify a reason for this. Not all samples outside the T2 tolerances were

excluded from further analysis, as these may symbolise significant changes associated with the

degree of effect of a sporidesmin challenge. Rather, observations were excluded from further

analysis when the spectra showed poorly resolved peaks, low signal to noise ratio, and/or a failed

injection. PLS‐DA and OPLS‐DA models were then produced using defined classes to aid in

separation of the data based on specific observations. For example, samples were classed by those

measured before dosing (Days ‐14 – 0) compared to those measured after dosing (Days 7 ‐ 42).

RStudio was then used to produce time series analyses. For the time series analyses, plots were

produced showing the m/z_RT pair intensity versus time curves for each cow, to compare the time

profiles across the sample population. The time profile of each m/z_RT pair was extracted for each

cow, and this was used to calculate the average time profile for the groups to which they belonged.

If necessary, missing time points were substituted by linear interpolation for internal points, and

replicating of the nearest data for missing end points. This information was then used to see which m/z_RT pairs showed large differences in the shape of the time curves when comparing groups,

while taking into account how much variation there was between cows within each group. A p‐value

and SDA‐rank were produced for each of these, combined, and then ranked to identify which

variables contributed the most to difference between the samples over time. The p‐values were not

In document The search for biomarkers of facial eczema, following a sporidesmin challenge in dairy cows, using mass spectrometry and nuclear magnetic resonance of serum, urine, and milk : a thesis presented in partial fulfilment of the requirements for the degree of (Page 160-162)