3.6 Discussion
4.1.1 Challenge of Microbiome Data
The extreme levels of sparsity in microbiome datasets is one of the major chal- lenges in data analysis. Indeed, it is not unusual to have over 90% of the counts being zero in these data as they contain a large number of rare taxa observed
in as few as 1 to 5% of samples. However, recent microbiome quality con- trol studies show that the majority of rare taxa are caused by contamination and/or sequencing errors. Potential sources of contamination are bacteria that are frequently handled in the lab, those that reside on the skin of lab workers, or in the extraction kits [Salter et al., 2014]. Several studies have been con- ducted using ‘mock’ samples curated so that they consist of known microbial species in prescribed proportions and, after cultivation, the samples are se- quenced using NGS technology to identify the taxa and evaluate the effects of such contamination on the observed taxa counts [Brooks et al., 2015]. Errors, especially due to misclassification, arise as the sequencing technology employs a combination of statistical and computational algorithms that make assump- tions about identifying nucleotide bases [Cacho et al., 2015] and for assembling the DNA fragments during the alignment process [Li and Homer, 2010]. Over- all, contamination and sequencing errors lead to either falsely identifying taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. The most common approach to address this problem is filtering, or removing spurious taxa from the 16S data set, which is a variation of an ad hoc, albeit simple, procedure. For example, one of the most widely used techniques for filtering in microbiome studies selects taxa that have a number of counts above
m = 0 in at least n samples.
In practice, it is often of interest to use taxa as covariates to predict dis- ease outcomes and understand their association with the host health. Exam- ples include predicting small intestine bacteria overgrowth (SIBO) condition using taxa sequenced from the intestine [Leite et al., 2019], testing whether dietary interventions shape gut microbiota [Albenberg et al., 2012] and un- derstanding the impact of a probiotic intervention on the composition of the
human microbiota [Lahti et al., 2013]. However, in high dimensional setting (large number of variables), it is challenging to find a few important predic- tors [Fan and Lv, 2008]. Indeed, with the high dimensionality p, computa- tional cost and prediction accuracy are two top concerns for any statistical procedure, especially in the presence of extreme sparsity. Hence, dimension reduction for sparse data is often recommended to reduce computational bur- den by effectively identifying the subset of important predictors and improve estimation accuracy by using well-developed lower dimensional methods.
In microbiome setting, contaminant and rare taxa may be considered as unimportant predictors. While several techniques have been proposed to de- tect and remove them, the literature in this research area is scarce. One ap- proach, developed by [Knights et al., 2011] and implemented in the R package sourcetracker, relies on microbial source tracking to identify the proportion of contaminant taxa in each sample by matching the taxa table to the database of known contaminants.
However, this method does not detect individual contaminant taxa that should be removed from the data set. [Davis et al., 2018] addressed this problem by introducing the decontam R package that identifies contami- nants by: (1) inversely correlating taxa frequencies with sample DNA con- centrations; and (2) using the prevalence of sequenced negative controls
[Salter et al., 2014]. A major practical limitation of this method is that the
auxiliary data from DNA quantitation, which is in most cases intrinsic to sample preparation or negative controls data, might not be available.
Recently, [Smirnova et al., 2018a] introduced a filtering loss measure and a principled filtering test, PERFect, for deciding which taxa to remove. In contrast to the standard procedures, which assume that taxa in a biological
network are isolated, PERFect filters out taxa with insignificant contribution to the total covariance. This method relies on ranking taxa importance, mea- suring their contribution to the total covariance, and quantifying the chance that the loss increase for a set of filtered taxa is due to randomness. The two principled filtering methods, simultaneous and permutation algorithm, rely on estimating the null distribution for the increase in filtering loss due to each taxon. The simultaneous approach fits one distribution for the filtering loss differences for all taxa, whereas the permutation approach generates a distri- bution containing k permutations of filtering loss differences and fits it for each taxon. Thus, one major limitation of our initial software implementation was the computational intensity of the PERFect permutation method, which was shown to be both a statistically rigorous and highly effective filtering approach. Here, I introduce the fast implementation of the permutation PERFect method that efficiently selects a small subset of taxa to build the distribution necessary to assess the taxon’s significance. The process of selecting this taxa subset is performed using an unbalanced binary search algorithm [Morin, 2013] that op- timally finds the set of taxa to be removed without building the permutation distribution and computes the p-values for all taxa. The proposed approach successfully reduces algorithm running time by almost four times.
The effects of filtering are further evaluated on two major exploratory anal- yses used in microbiome research: alpha and beta diversity. The methods were applied to two data sets, namely the MicroBiome Quality Control (MBQC) project from [Sinha et al., 2015] and the laboratory contamination dataset (Salter) from [Salter et al., 2014]. Results show that the filtering methods reduce the magnitude of differences in alpha diversity for samples containing the same bacteria processed at different MBQC project labs. Filtering further
reduces dissimilarity between samples (beta diversity) that contain the same microbiome and potentially alleviates technical variability. In the next section, we will be introduced to the setup for MBQC data, which will be used as a guided example, to reinforce our understanding throughout the whole methods section.