New computational methods - New computational methods and plant models for evolutionary genomic

A large component of this thesis has been the development of new computational methods required for the analysis of genetic sequences. These methods range from low-level utilities to new methods implemented in large software projects and published in upper-tier bioinformatics journals. Of these tools, several are improved analysis methods for very high throughput Genotyping-by-sequencing (GBS) experiments. Axe (Chapter 2) enables the demultiplex- ing of many hundreds of samples from one sequencing run. Given the increased output of sequencing runs, multiplexing large numbers of samples is required in order to realise the

increased cost effectiveness of newer sequencers, and no demultiplexer was able to efficiently demultiplex index sequences required by the GBS protocol. The efficient and reusable implementations of low-level sequence quality control measures in libqcpp and trimit (Chapter 3) provide a “one-stop-shop” for both GBS and other data types, reducing the intellectual burden required to use a litany of tools for several similar steps of an analysis (particularly for those with less computational experience).

In developing kWIP (Chapter 4), I enable the estimation of genetic distance from raw sequencing data. This tool allows researchers to perform initial analyses like confirming sample identity or family structure without a reference genome. KWIP direct estimation of distance from sequence read data enables researchers to use downstream analyses that require only a genetic distance matrix (e.g., Generalised Dissimilarity Modelling of isolation by distance and environment) in situations where alignment to a reference genome is infeasible. How- ever, genetic distance is only one of several metrics of interest in most studies. Using the same underlying data structures and approach employed in kWIP, one can estimate inter-sample covariance; in fact an approximation of sample covariance is an intermediate in kWIP. Many recent methods in population, landscape, and quantitative genomics define various processes as a function of sample covariance (e.g. population structure and IBD in conStruct Bradburd et al., 2017, geogenetic space in SpaceMix 2016; genomic selection using gBLUP VanRaden, 2008). Such a method would be especially powerful if it incorporated a PCAngsd-like op- timisation of the covariance matrix (Meisner and Albrechtsen, 2018) to minimise the eﬀect of sequencing noise with lower coverage datasets. Incorporating the estimation of sample covariance in a kWIP-like method would also assist some quantitative genomic methods, for example to extend gBLUP-based genomic selection (VanRaden, 2008). These genetic distance type association experiments can be extended to include microbiome information without the diﬃculty of metagenome assembly and are promising for prediction of traits in field samples.

The three methods I discuss in detail in this thesis are far from the only methods I have developed for analyses performed in this thesis. Currently unpublished tools include short-read sequencing utilities (e.g.seqhax; https://github.com/kdmurray91/seqhax), tools for retriev- ing data from the NCBI Sequence Read Archive (srapy; https://github.com/kdmurray91/srapy),

and C++ and python libraries for exact and probabilistic kmer counting (pymer and

kmerkmer; https://github.com/kmerkmer/). While in many cases these are far from the only tools to implement such functionality, they all advance previous implementations in some way (e.g. eﬃciency, ease of use, application specificity). These open source software

tools continue to be developed, and have been used in numerous projects, both within and beyond my immediate research environment. Additionally, I’ve contributed large quantities of code to external projects, which in some cases resulted in co-authorship on software publications (see Appendix A).

The aim of scientists developing new methods and software has predominantly been the creation of tools that provide accurate results, as this is overwhelmingly the concern of users (and rightly so). However, given the limited funding and scarcity of academic status given to those developing new software, these new methods often have less than ideal software quality and user friendliness, especially for methods developed outside one of only a few very well resourced bioinformatic method development teams (e.g. Broad Institute, Wellcome Trust, Sanger Centre). While such tools are obviously preferable to well-designed, user friendly, but methodologically inferior software, we need to acknowledge that the state of academic research software often imposes a significant intellectual burden on users. As an example, one software crucial to analyses performed in Chapter 6 had a bug which caused incorrect results to be produced from our dataset. To find and fix this bug, I required the ability to debug running C++ programs, and then deep knowledge of both C++ and the metric of interest. Without my background in software engineering this fix would not have been possible, and I would either have detected this error and terminated those analyses, or worse carried these incorrect outputs on to further analyses. I see it as the responsibility of tool developers to en- sure their users need not be experts in software engineering to run their software successfully, a low bar that many bioinformatic methods fail to meet. Only through increased respect and funding for method development can we improve this situation.

In both Chapters 4 and 5, I demonstrate that the analysis of genomic data from leading- edge molecular methods can require the development of new software, often initially specific to some larger genomic project. This implies that direct collaboration between field biologists and computational biologists is just as important as the collaborations that typically exist between field biologists and the molecular biologists whose expertise generates these datasets. Going further, the training of field biologists in bioinformatics is no less important than their training in molecular biology. We must prevent the biologists of the future being either insuﬃciently confident or skilled computationally to analyse the large datasets that they will be able to generate with minimal expense. Good experimental design depends on knowledge across field, molecular, and computational biology, and when these skills are jointly brought to bear, exciting and now published results, methods, and findings have appeared.

In document New computational methods and plant models for evolutionary genomics (Page 120-123)