CHAPTER 1: THE HUMAN MICROBIOME AND ANALYSIS
1.2 Review of Strategies for
1.2.2 Sequencing Strategies and
1.2.3.2 Statistical Analysis of Compositional Data
Once amplicon or WGS reads have been transformed into OTU tables, statistical analysis can be done. The complex nature of microbiome datasets makes them
challenging to analyze appropriately and improved methods are constantly being developed to do it better. Various unsupervised, exploratory methods have been used, such as clustering and resampling methods, as well as univariate and non-parametric models [105]. Multivariate statistics have been developed and applied to microbiome datasets but may fail to be appropriate for the data as they tend to assume linearity when microbiome data is generally curved [105,106]. Other statistical challenges include the compositional nature of the data and its sparseness [106,107]. Due to variation and error in PCR and sequencing, it is not possible to get absolute abundances of bacterial taxa from microbiome sequencing data. However, relative abundances can be calculated in
which the percent composition of each taxa totals to 100% for each individual sample. This compositional nature means that changes in abundance of one taxa will drive changes in the others, since the data is forced to sum to a constant [107]. Rarefaction, which randomly resamples to the size of the smallest library, has been used to correct for this compositional data, but it has been argued that rarefaction is inappropriate to use for this purpose [108]. McMurdie et. al. demonstrated that rarefaction results in high false positive rates when identifying significant differences in species abundance and it eliminates sequences that can be appropriately clustered using other methods. The continued use and prevalence of rarefaction in the microbiome field highlights how important it is for biologists to understand the theory behind statistical and computational methods to analyze microbiome data. Inappropriate application of statistical models can lead to conclusions that do not support the underlying biology.
Development of appropriate statistical analysis methods for microbiome data is challenging. Corrections for the compositional nature of the data include log-ratio
transformations which, in theory, do not alter underlying covariance or correlation among the data and allow application of traditional statistical analyses [107]. However, the sparseness of microbiome data often makes this transformation problematic, as it requires dividing by the geometric mean of the taxa. If the mean is zero, the value becomes
undefined. Pseudo-counts have been used to correct for this, in which the same random, small number is added to all counts so that none are zero. This poses problems too, as division by 1 is the same as analyzing unnormalized data and the consequences of using other values is not well understood, particularly in light of the importance low-abundance taxa may play in microbial communities. Despite these issues, transformation of
compositional microbiome data enables use of traditional statistics methods to determine significant changes in microbial populations.
Microbiome statistical analysis methods draw heavily on diversity methods from the ecology field. Species diversity indices are widely used to simplify complex
microbial communities by assigning values that represent overall trends in the population [109,110]. These indices have been used to compare changes in microbiota diversity according to relevant community variables, such as environment and patient disease state. They fall into one of two categories; alpha diversity, which quantifies within-sample taxa diversity, and beta diversity, which quantifies between-sample diversity [106]. Several methods exist to calculate alpha diversity, including the widely-used Shannon and Simpson indices [110]. Both of these indices combine measures of taxa richness (the number of different taxa) and abundance but do so with different underlying theoretical foundations. The Shannon index is abstract and represents uncertainty in identifying unknown taxa, while the Simpson index is more intuitive and indicates the probability of two randomly chosen taxa belonging to different species [110]. Species evenness, or the number of individuals within each taxon, can be derived from both of these indices. The Chao 1 index is used with less frequency but is a non-parametic method that can estimate OTU richness and performs well with low-abundance communities [109,111]. Several R packages can be used to calculate alpha diversity, including vegan [112] and phyloseq [113]. QIIME and mothur contain scripts for alpha diversity as well, as does the open- source software Explicet [114]. Several studies have compared the usefulness of these indices when applied to metagenomic datasets and generally agree that all three are appropriate, despite their varying foundational theories, and suggest that studies may
benefit from using and comparing all of them to determine interactions within communities [109–111].
Beta diversity measures allow comparison of similarity and dissimilarity among microbial communities. They are particularly important in identifying trends over time within large datasets [115]. Commonly used beta diversity measures include Morisita- Horn similarity and Bray-Curtis dissimilarity [114]. Both of these beta indices and the previously described alpha indices are based on normalized counts of taxa and do not take phylogenetic relationships into account. Phylogeny indicates the evolutionary history of organisms, and trees can be built in order to represent these relationships [106,116]. Fast UniFrac is a popular beta diversity method that calculates distance of relatedness of microbiota based on the branch lengths of phylogenetic trees [116]. The fact that it is a distance metric allows analysis of the resulting data with standard multivariate methods, such as clustering and principle components analysis (PCA). Fast UniFrac was developed by the Knight lab and is included in both the QIIME and mothur pipelines. Visualization of Fast UniFrac data with PCA plots allows easy identification of community similarity by clustering. Fast UniFrac has been cited in over 200 papers and has been used to compare similarity among environmental and host-associated microbial communities. Several recent papers have used it to compare bacterial communities in sludge systems [117], subtropical rainforests [118], and recurrent aphthous stomatitis, an oral mucosal disorder, in patients [119]. Each of these studies also employs a range of alpha diversity indices to compare microbiota. Beta diversity measures are useful in understanding overall trends and changes between samples in a study.
Though diversity indices are useful in assessing overall trends among microbiota, identifying differential abundance of individual taxa among groups may indicate specific bacteria that play important roles in the environment or disease state. Besides the
compositional nature of microbiome data, both its sparseness and its tendency to be dominated by a few taxa make appropriately modeling this data difficult [106]. As mentioned previously, log-ratio transformations can be used in order to apply standard downstream statistical analyses. Dirichlet multinomial mixtures have been developed that take into account data sparsity as well as the presence of diverse and rare taxa [120]. Two-sample t-tests have been employed to determine differential abundance among abundant taxa and Fisher’s exact test has been used for rare taxa [106]. Variations on the Wilcoxon rank-sum test have been used as well [114]. Specific tools have been
developed to manage the challenges of microbiome data and identify significantly enriched taxa. Curtis Huttenhower’s group at Harvard has developed a suite of analysis tools written in a combination of Python, R, and Perl that perform both compositional and statistical data analysis. These tools have been implemented in the Galaxy platform, which is a web-based environment that allows researchers without a programming background to analyze high-throughput data [121]. The Huttenhower group’s programs LEfSe and MaAsLin can be used within Galaxy to determine significant enrichment of bacterial taxa based on relevant biological information [122]. LEfSe employs a
combination of the Kruskal-Wallis rank sum test, the Wilcoxon rank-sum test, and linear discriminant analysis in order to rank significant enrichment of bacteria between two biological classes, such as diseased and healthy. Wu et. al. recently used LEfSe to detect bacteria significantly enriched among gut microbiota of normal control mice and those
exposed to lead [123]. MaAsLin takes this a step further and allows detection of enriched taxa among multiple biological classes. The previously mentioned pipelines and R packages also contain methods to detect differentially abundant taxa and visualization options to compare them. Given the issues in appropriate statistical methods to detect enriched taxa, experimental methods such as quantitative PCR (qPCR) should be used to confirm bacterial abundances.
While diversity indices provide overall trends among microbiome data and differential abundance detects changes in specific taxa, co-occurrence relationships and network analyses aim to understand how the microbes in a community interact with each other or respond to specific variables [124]. Rather than describing how and to what degree microbial communities change, network and co-occurrence analyses predict how taxa influence each other or are altered by outside variables through the use of correlation coefficients and networks. These methods are a type of dimensionality reduction, in which complex microbiome data can be mathematically condensed into a simpler version that is easier to interpret and understand. Various studies have used both Pearson and Spearman methods to calculate correlation coefficients for changes in microbial taxa and external factors, such as exercise [125] and bacterial metabolites [126]. Though the Pearson method is appropriate for parametric data and Spearman for non-parametric, neither of these methods takes the compositional nature of the data into account [127]. Sparse Correlation for Compositional Data, or SparCC, was developed to determine pairwise correlations between microbial taxa while correcting for the data’s
compositional nature [128]. It relies on log-ratio transformation of the data and has been shown to produce fewer false correlations than the Pearson method. Another method,
Sparse Inverse Covariance Estimation for Ecological Association Inference (SPIEC- EASI), dispenses with pairwise correlations and instead attempts to infer the entire correlation network simultaneously [129]. It does this through use of a graphical model inference framework that assumes the data is compositional and sparse. Though SPIEC- EASI is more reproducible than SparCC, results from the methods are not directly
comparable due to the different ways in which they calculate microbial correlations. Each of these tools is useful in determining microbe-microbe and microbe-external variable correlations that could indicate their potential in predicting interactions or outcomes.
Statistical analysis of microbiome data is complicated and requires knowledge of both the biology behind the study as well as the mathematics driving the models and algorithms employed. Appropriate models are still under active study, making it crucial that biologists collaborate with statisticians and bioinformaticians in order to choose the best model for the data to generate valid biological conclusions.
CHAPTER 2: BURN AND INHALATION INJURY AND ITS RELATION TO THE