Chapter 2: Materials and methods.
2.10 Bioinformatic methods 1 Genome sequencing
2.10.3 Read mapping
In order to align the reads to a reference sequence, Bowtie2 (Langmead and Salzberg, 2012) was used. The reference genome for the ancestral phage was retrieved from Genbank (AF176034.1) in FASTA format, and indexed.
bowtie2-build ref_genome.FASTA phix_ref
Following this, each FASTQ file containing a set of merged reads, as well as paired FASTQ files containing unmerged reads, were aligned against the indexed genome to produce a SAM (sequence alignment/map) file. This file contains all the data within the original FASTQ file in addition to mapping data.
bowtie2 -x phix_ref –U merged_reads.fastq -S mapped_singles.SAM bowtie2 -x phix_ref -1 sickle_fwd.fastq -2 sickled_rev.fastq \ -S mapped_pairs.SAM
SAMtools (Li et al., 2009) was used to merge the SAM files generated from merged and paired reads, and convert the SAM files to the compressed binary BAM format, which takes up less space and can be processed more quickly. SAMtools was also used to sort and index the BAM files.
samtools merge mapped.SAM mapped_singles.SAM mapped_pairs.SAM samtools view -bS mapped.SAM > mapped.BAM
samtools view -bS mapped.BAM | samtools sort - sorted.BAM samtools index sorted.BAM
The LeftAlignIndels function of the Genome Analysis ToolKit (McKenna et al., 2010) was carried out on each BAM file. When indels appear in a
sequence, they can often be aligned in multiple configurations (figure 5.1). It is important to align all indels to the leftmost position possible to standardise downstream processing and ensure indels are not mistaken for substitutions.
java -jar GenomeAnalysisTK.jar -R ref_genome.fasta \ -T LeftAlignIndels -I Sorted.BAM -o left.BAM
Although ΦX174 has a circular genome, the mapping algorithms treat the reference genome as linear, meaning there will be a break in the sequence (between positions 5386 and 1). If a read spans this break, it will be unable to map to the reference genome accurately, and an indel may be called
erroneously. Since read lengths are short, the region of the chromosome that is affected by this is small. Assuming a maximum read length of 250bp, there are 498 positions that could be affected if they span this region. In bacterial
chromosomes where genomes are typically several million bp in length (Wang et al., 2013) this is a relatively minor problem because such a small fraction of the total genome would be affected. However, the genome size of ΦX174 is only 5386bp, meaning over 9% of the total genome would be covered by this region.
To account for this, a FASTA file was created that spanned this break, running from positions 5041 - 5386 and 1-350 of the original reference genome. This was indexed in Bowtie2 as before, and FASTQ files were mapped against it. SAMtools was again used to convert to BAM, index, sort and left align indels.
2.10.4 Variant calling
Quality scores for each base are given in Phred format, which is logarithmically related to the probability of the base call being erroneous. Phred score Q is given by the equation:
Q = -10 log10 P
where P is the probably of an incorrect base call. A Phred score of 10 corresponds to a 90% base call accuracy, while 20 is 99%, 30 is 99.9%, and so on. Although Q = 30 usually gives sufficient confidence, when sequencing at high coverage a number of miscalled bases are inevitable. For example, if read depth at a particular nucleotide is 1000, and Q is uniformly 30 at that position, then it is probable that one read will be miscalled.
Since the majority of DNA fragment sizes in this experiment were smaller than the maximum read length, most pairs of reads fully or partially overlapped, and were merged with PEAR. Since each read's quality scores are calculated independently of each other, if a base is identical on both reads an updated quality score can be calculated by multiplying the individual quality scores together (figure 2.10). For example, if a base has Q = 30, it has a 0.1% chance of being an error. But if the same base is present on the other read with the same Q score, the probability of it being miscalled twice is 0.0001%, i.e. Q = 60.
To determine mutations present in a sample, only positions with a Q of 40 or higher were considered. This allows even low frequency mutations to be identified with high certainty at the sacrifice of coverage; non-overlapping
portions of reads would likely be removed by the filter. It should be noted that while a Q of 40 means that each base has a 0.01% chance of being miscalled, 40 is actually the maximum Phred score that the FASTQ encoding supports. In actuality, most bases that meet this quality requirement would have a much higher true score due to being a product of two quality scores.
Freebayes (Garrison and Marth, 2012) was used to call variants. To call higher frequency mutations, minimum Q was set to 40 (-q 40), and only mutations with ≥ 10% frequency (≥ 1% for indels) were returned in the output (- F 0.1). For each quality setting, two sets of VCF files were generated per sample, returning either SNPs or indels (using the argument -i to ignore indels or –I to ignore SNPs). The following arguments were used: -X (ignore multi- nucleotide polymorphisms), -u (ignore complex events), -K (output all alleles which pass input filters), -J (assume that samples result from pooled
sequencing), and –p 1 (no ploidy).
freebayes -f ref_genome.fasta -q 30 -m 20 -F 0.1 \ -X -i -u -K -J -p 1 left.BAM > snps.vcf
VCF files were generated in this way for both the BAM file mapped against the reference genome, as well as the BAM files mapped against the region spanning the break of the circular phage genome. The Python script OriginPositionFixer.py (appendix X) was used to renumber the latter with genome positions that corresponded to the original reference genome. This was followed by the Python script OriginMerger.py (appendix X) that parsed the main VCF file. At each position, it checked to see if that position was present in
the second VCF file, compared the coverage, selected the line with the highest depth, and wrote the line to a new file. This output of this was a merged VCF file with high coverage at the beginning and end of the genome.
Finally, the Python script VCFsimplifier.py (appendix X) was used to parse each VCF file and return a list of alleles and their frequencies in a more readable format.
Figure 2.2 - FastQC output before (top) and after (bottom) reads were merged with PEAR. Overlapping regions from paired end reads were sequenced twice, meaning quality scores were multiplied together if the sequenced base at each position agreed. 40 is the maximum Phred value that FASTQ encoding supports, but most true quality scores will have been much higher. These data were from line A1, passage 10; but are representative of all samples.
Table 2.2 – DNA sequences for indices, primers and adaptors from the NEBNext DNA Library Prep Kit
ID Sequence Index 1 CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 2 CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 3 CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 4 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 5 CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 6 CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 7 CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 8 CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 9 CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 10 CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 11 CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Index 12 CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T Universal primer AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC
Adaptor 1 GATCGGAAGAGCACACGTCTGAACTCCAGTC Adaptor 2 ACACTCTTTCCCTACACGACGCTCTTCCGATC