Phylogenetic and bioinformatics analyses - General Materials and Methods

Chapter 2 General Materials and Methods

2.3 Phylogenetic and bioinformatics analyses

2.3.1 Phylogenetic reconstruction of PR-RT sequences

The phylogenies of intra-patient viral populations were estimated via a maximum likelihood (ML) approach with the program raxmlGUI version 1.3 [213]. Given a sequence alignment, the ML method determines the probability of observing a tree [214, 215]. The likelihood of all possible trees for a given sequence alignment, i.e the probability of observing a particular tree given the alignment and an explicit model of nucleotide substitution, is calculated and the tree that has the greatest likelihood is selected as the most probable one [216]. The statistical robustness of the trees was evaluated by bootstrap analysis with 1,000 rounds of replication. The phylogenetic trees were visualized and edited using FigTree software version 1.4 provided by

http://tree.bio.ed.ac.uk/software/figtree/.

2.3.1.1 Inter-patient viral evolution

An MSA was generated for all single genomes derived PR-RT sequences from each sampling time point per patient and was used to reconstruct an ML phylogenetic tree under the GTR model of nucleotide substitution and 1,000 rounds of bootstrapping. The total ML-tree was visualized in FigTree 1.4 and rooted by midpoint rooting to determine if there was contamination between patient data or from external genetic sources such as HIV based plasmids used in the laboratory.

2.3.1.2 Intra-patient viral evolution

For each patient, multiple sequence alignments were generated from single genome derived PR-RT sequences obtained at each available sample time point for that patient and subsequently imported into raxmlGUI version 1.3 to construct ML phylogenetic trees under the GTR model of nucleotide substitution and 1,000 rounds of

bootstrapping. Each tree was rooted against an outgroup which was a HIV-1 subtype C PR-RT sequence from the test sample, TS5, used to troubleshoot the PCR and SGS protocols. The criteria I used to choose an appropriate outgroup to predict the

direction of evolution within the ML trees were an HIV-subtype C PR-RT sequence from an ART naïve patient. This sequence was related enough to the patient

sequences so that it was basal to the rest of the sequences in each tree, but not too closely related that it grouped with test sequences.

2.3.1.3 Mean pairwise genetic distance to measure population diversity

I used the Molecular Evolutionary Genetic Analysis (MEGA) software version 5.2 [217] to calculate the mean pairwise genetic distances (MPDs) of PR-RT sequences within and between sampling time points in each child. In this case, I used the Tamura and Nei 1993 nucleotide substitution model, which was determined as the best fit model for this data by MEGA (lowest Bayesian Information Criterion, therefore highest posterior probability). Regression analysis was used to determine if the number of sequences obtained at each time point affected the estimation of genetic distances. I determined whether differences in the mean number of nucleotide substitutions per site in PR-RT MPDs between consecutive time points were significant using an unpaired two-tailed Student’s t-test. MPD was expressed as number of nucleotide substitutions per site of all PR-RT sequences.

2.3.2 Assessment of recombination in PR-RT

Evidence of recombination between PR-RT sequences derived from single genomes over time was determined using the Single Breakpoint Analysis (SBP) and Genetic Algorithm Recombination Detection (GARD) from the online Datamonkey software package at http://www.datamonkey.org. Significant breakpoints were reported for P values <0.05.

2.3.3 Intra-patient analysis of selection pressure on the HIV-1 PR and RT genes

Intra-patient selective pressures on HIV-1 PR-RT and gag-PR-RT were determined with the Datamonkey software package. For positive selection analyses, the rate of non-synonymous substitutions per non-synonymous site (dN) over the rate of synonymous substitutions per synonymous site (dS), dN/dS or ω, was calculated using three different algorithms. If dN/dS = 1 then this suggested neutral selection, if dN/dS is <1 this suggested negative selection because there were more synonymous substitutions than non-synonymous ones (indicating that non-synonymous changes at that site are removed from the population) and if dN/dS is >1, positive selection is suspected because there were more non-synonymous substitutions than synonymous ones (Vandamme et al., 200λ). The three different algorithms used to determine ω within each patient viral population were FEL (Fixed Effects Likelihood), SLAC (Single Likelihood Ancestor Counting) and FUBAR (Fast Unbiased Bayesian AppRoximation).

All three methods calculate dN/dS ratios for each codon in a given sequence alignment: FEL, which directly estimates nonsynonymous and synonymous

substitution rates at each site [218]; SLAC, which estimates the number of non- synonymous and synonymous substitutions that occurred at each codon in an alignment, by reconstructing the most likely ancestral sequences and counting

substitutions using a weighting scheme [219]; FUBAR, which detects selection under a model which allows substitution rate variations from site to site and calculates the mean posterior distribution of synonymous (α) and non-synonymous (β) substitution rates [220].

Multiple sequence alignments of PR-RT accompanied by corresponding ML-trees generated from raxmlGUI version 1.3 were uploaded to the online Datamonkey platform. Substitution models were determined using the automatic substitution model selection tool, which selected the HKY85 model for PR-RT sequence alignments. SLAC, FEL, and FUBAR algorithms were then run with the selected substitution model with confidence intervals of 1.0 and a significance level of <0.05 (SLAC and FEL) or a posterior probability of 0.95 (FUBAR).

2.3.4 Co-evolution analysis

I also used the algorithm Spidermonkey [221] at http://www.datamonkey.org to determine if positively selected sites by SLAC, FEL or FUBAR co-evolved, i.e. if the evolution of amino acids at any pair of positively selected were dependent on each other during the course of protein evolution

2.3.5 Position Specific Scoring Matrix

I used the perl-based script “aa_freq.pl” developed by Professor Simon Watson (unpublished) to produced a position specific scoring matrix (PSSM) from an MSA of

amino acids from the 891 HIV-1 Subtype C Gag sequences from treatment naïve children from Sub-Saharan African children that were available in the HIV Los Alamos sequence database on June 1st 2014. I used the PSSM to determine the natural variation of amino acids found at these key positions in Gag for patients from my study cohort using population sequence analysis.

In document Population and single genome kinetics driving the evolution of multiple linked multiclass drug resistance mutations in the viral protease and reverse transcriptase of HIV-1 subtype C in children receiving early protease inhibitor based combination therapy (Page 89-93)