Chapter 2 Materials and Methods
2.4 Missing data
Exposure to an extreme environment, such as the conditions experienced during CXE 2007, leads to the stimulation of multiple adaptive response pathways. These pathways communicate and interact with one another, to produce a complex systems response to stress. Such responses mean that missing data values are likely within a dataset that is trying to capture the complexities of such a response. These may be due to certain samples or a specific technique not being available for measurement (due to illness or a temporary machine failure), human error during experiments or measurements being either above or below quantifiable limits.
An incomplete dataset can become a problem when it comes to further anal- ysis. Some analyses will not work with missing values, and as biological experiments usually contain a relatively small number of samples, leaving one sample out of the whole analysis due to one missing datapoint could bias the analyses, or leave out vital information, leading to erroneous conclusions.
2.4.1 Treatment of incomplete datasets
If a dataset is missing values, there are two options available, depending on why val- ues are missing. If a measured sample gives a value above or below the quantifiable limit of the equipment, then there is still some potentially useful information that can be gained from this datapoint. If the sample is below the detection limit of the equipment, it is very unlikely that the sample is simply empty of the material you are testing it for (samples are rarely completely empty of a particular metabolite). The simplest solution to this problem would be to impute a value half-way between zero and the lowest limit of the equipment being used. This way, some information can be gained from the sample that may be useful to the analysis.
If samples are missing, it is more difficult to accurately impute their values. Individual values can be estimated by assessing how that individual responds in comparison to all other individuals, and impute the missing value based on that individual’s average ranking within the group. For example, if the individual ranks highly in the group for a particular metabolite, then an assumption could be made that the individual ranks similarly for all time points, and an appropriate value could be calculated from similarly-responding individuals. However, this may not be appropriate in some cases, and relies heavily on the assumption that the individual responds in a similar way compared to other individuals in the group.
There are several missing data points in both the biochemical and physiolog- ical datasets used within this project. Due to the extreme conditions of CXE 2007,
Biochemical metabolite Detection limit of technique used Value imputed Units IL-13 0.63 0.315 pg/ml Eotaxin 1.69 0.845 pg/ml CRP 60.00 30.000 ng/ml Adrenaline 0.12 0.060 fmol/ml Noradrenaline 0.08 0.040 fmol/ml Adipsin 422.00 211.000 pg/ml EPO 0.40 0.200 mlU/ml
Table 2.5: List of imputed values for biochemical metabolites measured in plasma samples taken during CXE 2007. All values were below the detectable limit of the technique used, and were imputed as half-way between the lower detection limit of the technique used and zero.
some individuals became ill during the expedition, and did not have samples taken at particular time points. A small number of individuals su↵ered severely from hypoxia and had to be evacuated from the mountain at higher altitudes. The biochemical measurements contained more missing values than the physiological data, mainly due toout of range measurements, as there is a higher level of error associated with measuring metabolites that are present in very low concentrations. Measurements taken at higher altitudes also contain more missing data points than measurements taken at lower altitudes. It is important to know why a particular data point is missing, as it can have an e↵ect on the assessment of the data, especially if it is due to an individual being ill. Some metabolites are also missing for entire altitudes. Due to this, several di↵erentversions of the dataset have been created, each useful for answering di↵erent questions about the dataset. The biochemical metabolites that had values imputed are shown in Table 2.5.
2.4.2 How missing values were treated in each dataset
For the core dataset, physiological measurements missing only a small number of values were imputed, to restrict the amount of bias these values may have on further analysis. This was done in a systematic way, by highlighting the individual on a box and whisker plot showing the behaviour of the variable over all altitudes. This was done to compare how the individual responded at known time points compared to the rest of the group. The individual was then ranked within the group, depending on their response. If this rank was similar across altitudes, a mean rank was calculated. The missing variable was then imputed based on this rank - for example if the mean rank was 2, the values for individuals ranked 1 and 2 were added and the sum divided by two to calculate a value in-between them. If the rank varied too much
within the group, a value could not be imputed this way. This method assumes that an individual’s behaviour remains relative to the way the rest of the group is behaving, and that this relationship is the same across altitudes.
The diary dataset was collected on each day of the expedition, and has many missing data values. No information has been imputed for this dataset, as too many data values are missing, and any imputation would risk biasing any analysis carried out on this dataset as it would be mainly based on estimated data points.
2.4.3 Core team datasets
Biochemical metabolites and physiological measurements that are missing some data points are still potentially valuable for analysis. The metabolites had several missing values, ranging from one complete altitude missing (24 missing data values) for all of the glutathione metabolites, to six altitudes missing data for IL-10 (approximately 144 missing data values). These metabolites can provide information on how the group are responding, but cannot be used in model building.
Due to these limitations, two separate datasets for the core team were formed. The first dataset contained only complete information (i.e. no missing data points for any variable included in the dataset), whereas the second dataset contained all measurements that included enough information to be useful (i.e. they contained at least a full altitude worth of data). The complete dataset was primarily used for modelling analysis, as well as any analysis that could not be performed on data with missing values. The incomplete dataset was primarily used for exploratory data analysis, looking at general patterns of expression within the group.
There were 10 individuals missing for all physiological measurements at Ever- est Base Camp Weeks 6 and 8. Two individuals were also missing plasma samples for Everest Base Camp weeks 6 and 8 as they became very ill during the expedi- tion, and had to be moved down to a lower altitude to recover. The main focus of this project was to assess the changes that occur with increasing hypoxic exposure, therefore, only measurements from London - Everest Base Camp Week 1 were used for modelling. This was because Everest Base Camp weeks 6 and 8 showed altitude- independent changes, and were missing too much information to be used reliably for modelling purposes. A full list of the biochemical metabolites and physiological measurements available for the core dataset are shown in Tables 2.3 and 2.4.
2.4.4 Variables omitted from the analysis
Some of the biochemical metabolites and physiological measurements were missing too many values to be imputed with any accuracy, and did not appear to give any information about any changes that occurred with increasing altitude. For example, IL-1↵ only had three samples out of 168 that were within the detectable limits of the BioPlex assay, used to measure multiple biochemical metabolites in the plasma samples. These biochemical metabolites have been removed from all analyses, and are as follows:
• Interleukin-1↵
• Interleukin-4
• Interferon-