Four hundred sequences of length 400 each where each base is randomly sampled from a uniform (0.2,

0.2, 0.2, 0.2, 0.2) distribution where the fifth option is for a gap.

400 400 1.9603 1.9902 1.9947 1.9929 1.9973 2 0.0061

10. Four sequences of length 10 each where each base is randomly sampled from a uniform (¼, ¼, ¼, ¼) distribution. A fifth sequence consisting only of gaps were also added.

5 10 0.8113 1 1 1.2623 1.5 2 0.4575

11. Four sequences of length 10 each where each base is randomly sampled from a uniform (¼, ¼, ¼, ¼) distribution. An eleventh position in which all sequences had only gaps were also added.

4 11 0.8113 1 1 1.2623 1.5 2 0.4575

12. Simulated low diversity sample. 872 471 0 0 0 0.0006 0 0.0331 0.0029

13. Simulated high diversity sample. 691 471 0 0.0157 0.0619 0.1853 0.2343 1.2008 0.2701

5.3.4 Future Work and Conclusions

The computeEntropyFromAlignedFasta.R script is a basic wrapper for the entropy.empirical function from the entropy package, with handling for IUPAC characters as described in section 5.3.2, gaps, and performing some basic result processing. This script provides the entropy statistics to the pipeline, where they are employed as measures of sequence diversity in the alignments. We performed a number of tests and they demonstrate that the entropy computation step works as expected. A key shortcoming of using statistics computed on entropies calculated at each position is the underlying assumption of independence between the positions. In the case in which a small number of sequences that diverge significantly from each other each occur many times in the sample, entropy will be very high. In such cases, entropy immediately following dual infection can possibly be higher than it will be for later samples when each of the two strains have developed a diverse quasispecies. Including different measures of diversity, as described in (Gregori et al., 2016), may ameliorate this problem.

5.4 PhyML

PhyML is phylogeny software designed to compute tree topology based on the maximum-likelihood principle (Guindon et al., 2010). PhyML is incorporated into the pipeline to produce trees that can be manually reviewed by experts as part of an audit of the data processing. Additionally, the average of the pairwise distances computed from the tree informs the classification of the sample as either a single founder or multi-founder sample.

5.4.1 Overview of PhyML

The main goal of PhyML is to estimate maximum likelihood phylogenies from aligned sequences.

Briefly, a set of rules relating the probabilities of the different mutations to each other is specified. An algorithm is then used to search through different parameterizations of the model (tree shapes, branch lengths and mutation rates satisfying the specified rules) for the parameter values under which the data is most likely. The most flexible set of rules for the probabilities is to assume that all probabilities are unconstrained of each other and is called the General Time-Reversible (GTR) model.

PhyML produces two output files: A file with a number of relevant statistics (file name ends with _phyml_stats.txt) and a file with the most likely tree (file name ends in _phyml_tree.txt) in Newick format. While it is running, PhyML prints the settings it was called with, the value of the log likelihood function for some trees and the total time taken to STDOUT.

5.4.2 Implementation Details

The pipeline includes a wrapper, called runPhyML.pl, that formats the data for PhyML, calls PhyML and parses the results. Note that a patched version of PhyML is required. Appendix 8.6 contains the instructions for obtaining this patched version of PhyML. The main script, identify_founders.pl, calls the runPhyML.pl script with the back tick notation and supplies two command line arguments which specify the input file and the location for the intermediate files and results. runPhyML.pl first changes the line ends of the input file to Unix style line endings (\n instead of \r\n) and then converts the file to the sequential Phylip format (hereafter referred to as the Phylip format). Phylip format specifies that the first line of the file must contain two integers separated by a space, the first number is the number of sequences in the file and the second is the length of the alignment (all sequences must be of equal length). The rest of the Phylip file contains the sequences, one per line. Each line contains the header and the sequence itself separated by a tab. A maximum of 4000 sequences can be passed to PhyML. If there are more than 4000 sequences, then only the first 4000 will be processed.

The call to PhyML is executed using the system function and several options, described in Table 35, are passed to PhyML as command line arguments. The output produced on STDOUT and STDERR by this call are captured into files whose names are formed by stripping the extension from the input

fasta file and appending .phylip_phyml.out and .phylip_phyml.err respectively. If any output was generated on STDERR, execution is halted and that output is printed to screen. If no errors were encountered, then the distance matrix is parsed out of the output that was printed to STDOUT. All the pairwise distances are stored in an array called @diversity. Summary statistics are computed on this array and stored in a file whose name is constructed by stripping the extension from the input fasta file and appending ‘.phylip_phyml_pwdiversity.txt’ to it. The summary statistics recorded in the pairwise diversity file includes the number of sequences, average, standard error, minimum, median, maximum and the 1^st and 3^rd quartiles.

Table 35: The parameters specified in the call to PhyML by the runPhyML.pl script.

Option Flag

Value Description

-i Name and path of the input file

-d nt The sequence data type, nt indicates nucleotides.

-q NA When the –q flag is specified, then PhyML will treat the input data as if it is in sequential Phylip format.

-b 0 Do not use bootstapping.

-m GTR Use the general time reversible substitution model.

-v e The proportion of invariable sites, where an invariable site is a site that does not evolve. The value ‘e’ means that this proportional should be estimated from the data.

-c 4 The number of categories of substitution rates. Rates of evolution vary from site to site. The value 4 indicates that the model should divide the rates of evolution into 4 categories and estimate rates for each of these four categories.

-o tlr Instruct PhyML to compute the optimal tree topology, branch lengths and rate parameters. PhyML can be used to compute the likelihood of a given tree for a given dataset, so setting this parameter to ‘n’ will prevent PhyML from changing the topology, branch lengths or rate parameters and it will instead just compute the likelihood of the tree.

-a e The parameter (alpha) of the gamma distribution that models the substitution rates should be estimated from the data.

-f m The frequencies of the bases under equilibrium should be estimated using maximum likelihood.

-t e The ratio of transitions to transversions should be estimated from the data.

-s NNI Which process should be used to optimize the tree topology? NNI specifies a hill-climbing algorithm that simultaneously adjusts the topology and branch lengths to maximize the likelihood.

--no_memory_check Suppresses an interactive warning from PhyML if the analysis will use a large amount of memory

5.4.3 Test Procedure

Testing the integration of PhyML into the pipeline focused on two aspects: The presence of all sequences in the trees produced and relative comparisons on the pairwise distance metrics. Since the

trees will be reviewed manually and setting up algorithmic tests for tree topology is a complex task, we did not include any tests for tree shape. Ensuring that all sequences are present in the tree is required since manually checking for the presence of 100s of sequences is not practical. Computing the true solution for the pairwise distances is also a complex task since it is dependent on the tree topology, hence we primarily employed tests that compare these metrics on two datasets when we know that one dataset is more diverse than the other. Some manual inspection of the absolute values of PhyML’s pairwise distance estimates were performed and discussed where the unexpected was observed.

A further complication in testing phylogenetic calculations is that the simplest datasets (from a simulation point of view) present major obstacles to the algorithms that estimate the tree shape and parameter values. For example, the most basic dataset is one in which all sequences at all positions are exactly the same nucleotide. In this case there is no information for estimating the parameter values associated with the other three nucleotides and a very large set of shapes are equally likely.

This presents challenges to any algorithm that needs to converge to some optimal solution.

The testing is divided into three sections. Unless stated otherwise, datasets were simulated to mimic a bifurcating tree in which each offspring sequence has a chance to have a number of mutations away from its parent. Data simulation is described in detail later in the next section. The first set of tests evaluates edge cases exploring extreme homogeneity (1.1), extreme heterogeneity (1.2), order of the sequences (1.3), presence of gaps (1.4) and presence of ambiguity characters (1.5). The second set of tests focuses on relative pairwise distances. In the first test in this section, the number of mutations differentiating an offspring sequence from its parent were systematically increased (2.1). Relative pairwise distances were also compared on datasets simulated to mimic single, dual and ‘triple’

infection (2.2). A dataset that does not have a tree shaped phylogeny was simulated by taking an initial sequence and deriving sequences by randomly mutating random bases in the sequence (2.3). The last tests in the relative pairwise distances based section examined MiSEQ-like (2.4) and SGQ-like (2.5) datasets. The final group of tests use datasets designed to mimic either a real low-diversity sample (3.1) or a real high diversity sample (3.2). More details about the tests that were performed are listed in Table 40.

Each test involves calling runPhyML.pl to process three closely related datasets, a target dataset, a dataset designed to mimic the target dataset but with less variability and a dataset designed to mimic the target dataset but with more variability. The target dataset is also referred to as the moderately diverse dataset. The type of comparison to perform between the target and low and high diversity datasets must also be specified. For most cases, the low (high) variability dataset will be expected to

have strictly smaller (larger) average pairwise distances than the target dataset. However in some cases, such as testing the ordering of the sequences, the goal is to test for equality of average pairwise distances, since having the result of the phylogenetic analysis dependent upon the input order of the sequences is sub-optimal.

Six checks are performed when a test is run. First, the runPhyML.pl script should execute successfully on all three datasets. All the sequences from the original dataset should be present in the input file prepared for PhyML and all the sequences present in the input file prepared for PhyML should be in the original dataset. Since the output from PhyML does not include the sequences themselves, the only way simplistic way to check that PhyML was run on the correct sequences is to ensure that the number of leave nodes in the tree match the number of sequences in the original dataset. Lastly, the average pairwise distances from the low (high) diversity dataset should be smaller (larger) (or in special cases, equal) to the average pairwise distances. Additionally, the summary metrics of the pairwise distances were also recorded and reported. The results of all tests are presented in Table 41 and are further discussed in section 5.4.5.

5.4.4 Data simulation

Four distinct types of datasets were simulated for the testing of PhyML and Poisson Fitter. Poisson Fitter is an algorithm for computing how much time has elapsed between initial infection of a patient with HIV and when a sample was taken based on sequence data for the viral quasispecies. Poisson Fitter is discussed in detail in section 5.5. The first approach generated sequences by simulating a tree and producing a dataset from the leaf nodes. This approach should generate datasets that are easy for phylogenetic algorithms to model accurately. The second approach generates data that is not based on a tree structure and is equivalent to selecting an ancestral sequence and then generating sequences by randomly introducing mutations into the ancestral sequence. This so-called “star-like phylogeny model” has been proposed as sensible for acute HIV infection and is the model assumed by the Poisson Fitter algorithm(Elena E Giorgi et al., 2010). Unless otherwise stated, all simulations were initiated with a sequence that is just ACGT repeated until a sequence of adequate length was obtained.

Next the preparation of a custom dataset, referred to as the balanced dataset, designed to allow modification while preserving the nucleotide and mutation ratios. The final portion of this section describes the simulation of datasets based on real world samples.

5.4.4.1 Simulation of tree-like datasets

A simple recursive algorithm was used to generate the tree-like datasets. Given a single sequence, two sequences are derived from it by introducing between zero and 𝑥 mutations into the initial sequence.

The location of the mutations is determined uniformly without replacement, and the replaced

nucleotide is sampled uniformly from the other three nucleotides when a mutation occurs. No gap characters are introduced by this process and only non-gap characters are eligible for mutation. The parameter 𝑥 can be changed to control the amount of diversity in the dataset. Each of the two derived sequences are each then recursively treated as the single given sequence and two sequences are derived from each of them by introducing between zero and 𝑥 mutations per new sequence. Thus after two steps, four sequences were generated. This process is repeated n times until the desired number of leaf sequences are obtained. This simulation approach is referred to in the rest of this section as the recursive algorithm.

Figure 69: The average pairwise Hamming distance between two sequences in a dataset simulated using the tree-based algorithm based on the depth of the tree and the maximum number of mutations per generation. The black lines shows the predicted pairwise distance under the assumption that any position can only mutate once in the simulation process.

Sequences are of length 400, so that 1 mutation will result in a 0.25% difference between two sequences. A tree depth of 2 results in a dataset with 2²= 4 sequences while a tree depth of 7 yields a dataset with 2⁷= 128 sequences.

A dataset produced with this recursive simulation approach contains all the n’th generational offspring. Each offspring sequence differs from the ancestral sequence by ^𝑛𝑥₂ mutations on average since each of the n generations will introduce ^𝑥₂ mutations on average under the assumption that the probability of two mutations affecting the same nucleotide is negligible. Additionally, the relatedness between the leaf nodes implies that each sequence has one (2⁰) sequence that it differs from by 1 ∙ 𝑥 mutations on average, two (2¹) sequences that differs from it by 2 ∙ 𝑥 mutations, four (2²) sequences that differs from it by 3 ∙ 𝑥 mutations and so forth. Thus, on average two sequences in the dataset differ from each other by

∑^𝒏−𝟏_𝒊=𝟎[𝟐^𝒊(𝒊 + 𝟏)𝒙]

𝟐^𝒏− 𝟏

Equation 1

mutations on average assuming that no position in the sequence can mutate more than once. Figure 69 shows the measured (colored lines) and predicted (black lines) pairwise distances in datasets simulated using the recursive algorithm. At higher mutation rates and for larger generation numbers, the actual average pairwise sequence distances will diverge from the distances predicted by the previously mentioned equation as mutations will start to override each other as they will increasingly occur at the same positions.

5.4.4.2 Simulation of star-like datasets

To generate datasets that do not have a tree-like structure, a single frequency matrix containing the target prevalence of each nucleotide at each position is constructed using a simulation approach. In a second step, this frequency matrix will be used to generate the sequences of the dataset. The frequency matrix has four rows each representing a nucleotide. The number of columns in the frequency matrix correspond to the length of the sequences that will be simulated. A sequence is simulated from such a frequency matrix by drawing nucleotides for each position with the probabilities given by the frequency matrix. A dataset is simulated by simulating many sequences from the same frequency matrix.

To construct a frequency matrix, a parameter, called the dominance parameter (𝑑𝑜𝑚), and a target sequence is required. Each position is treated completely independently from the other positions. The dominant nucleotide (the most frequently occurring nucleotide) for the position is read from the target sequence. The prevalence for the dominant nucleotide at each position is sampled uniformly from the interval between the value of the dominance parameter and one. This produces a vector of frequencies, 𝑑𝑜𝑚𝑖, containing the frequencies of the most prevalent nucleotide at each position indicated by 𝑖 in the subscript. Note that each sampled 𝑑𝑜𝑚𝑖 is larger than 𝑑𝑜𝑚 parameter.The prevalences of the remaining three nucleotides at a position, 𝑖, is determined by uniformly drawing an integer, 𝑧_𝑖, between zero and two which specifies how many of the remaining three parameters will be non-zero. From the remaining three nucleotides (the non-dominant nucleotides at the position), 𝑧𝑖 nucleotides are randomly drawn and their prevalences are set to zero. The remaining nucleotide(s) will then be assigned prevalences equal to ^{1−𝑑𝑜𝑚}^𝑖

3−𝑧_𝑖 , where 𝑑𝑜𝑚_𝑖 is the prevalence of the most prevalent nucleotide for position 𝑖. Hence, at each position, a single nucleotide will occur the majority of the time, some (possibly none) nucleotides will never occur and the remaining mass will be equally distributed between the remaining nucleotides. This process is repeated for each position

in the dataset until the entire frequency matrix is populated. The prevalences for each position forms a discrete uniform distribution whose support is the four nucleotides, A, C, G and T. Simulating sequences with draws weighed by the prevalences contained in this frequency matrix yields a dataset with a star-like phylogeny. This simulation procedure will be referred to as the star-like simulation approach in the rest of this document. Unless specified otherwise, the target sequence is taken to be ACGT repeated until a sequence of the desired length is obtained.

As an example of the simulation process of the star-like approach, consider the simulation of a dataset with sequences of length two and a dominance parameter of 0.9 and target sequence TA. For the first position, four random draws are performed:

1) A number between 0.9 and 1 is drawn, say 0.92;

2) The dominant nucleotide is read from the target sequence, T;

3) Another number between 0 and 2 is drawn, say 1; and

4) Another letter (that does not match the first) is drawn, say a G.

The discrete uniform distribution for the first position will then be: zero chance of drawing a G, 0.92 chance of drawing a , 0.04 chance of drawing a G and an 0.04 chance of drawing a C. Independently from the first position, another four random draws are performed for the second position:

1) A number between 0.9 and 1 is drawn, say 0.99;

2) The dominant nucleotide is read from the target sequence, A;

3) Another number between 0 and 2 is drawn, say 2; and

4) Two letters (that does not match the first) is drawn, say G and T.

The discrete uniform distribution for the first position will then be: 0% chance of drawing a G or a T, 99% chance of drawing an A and a 1% chance of drawing a T. Using these two discrete distributions, a dataset of sequences can easily be generated.

on average. To convert this number to an average pairwise Hamming distance, multiply it by the length of the sequence. We saw above that in the recursive simulation approach, two parameters, the tree depth and the number of mutations per generation together strongly influences the eventual average

In document Development of a data processing toolkit for the analysis of next-generation sequencing data generated using the primer ID approach (Page 153-189)