• No results found

1.7 Methods used to study HCV

1.7.4 Next-generation sequencing (NGS)

Next-generation sequencing has dramatically changed the possibilities for viral population analysis within infected hosts. NGS can be used to examine HCV strains present as majority and minority variants and is far more reliable for

producing whole genome coverage than classical sequencing methods (Sanger sequencing).

As NGS is such a powerful technique, generating millions of sequence reads in a single run, a metagenomic approach can be used to identify whole viral genomes and as the technique is unselected (and does not require specific PCR primers) can be used to discover new viruses. NGS has recently been used to identify the aetiology of Merkel cell cancer (Feng et al., 2008) and to identify a novel bunyavirus in patient with “severe fever with thrombocytopenia syndrome” (Xu et al., 2011).

This metagenomic approach recently aided the characterisation of the virome in children with acute diarrhea and fever (Wylie et al., 2012). While this technique is powerful, it is inefficient as the majority of sequence reads produced is not viral in origin. It can be improved, as described later in this thesis by the use of target enrichment (magnetic beads attached to virus-specific oligonucleotides) in order to purify the virus of interest.

This thesis describes the use of a metagenomic and target enrichment approach for sequencing HCV (and HIV in co-infected patients).

NGS using the Illumina platform 1.7.4.1

The Illumina platform is currently used more widely than other available NGS platforms due to efficiency and cost. The principle is based on a sequencing-by-synthesis approach, meaning that the four nucleotides are added with DNA

The main error found when using the Illumina platform is a substitution error, meaning that an incorrect nucleotide is identified due to the de-blocking step not being well performed causing a cluster to fall out of phase or due to interface noise as a result of the incomplete cleavage of fluorescence label prior to DNA cycles (Mardis, 2013). However, Illumina currently produces data of higher quality than other platforms, with low error rates, making it the first choice for many genome-sequencing projects (Zhang et al., 2011).

Next-generation sequencing creates a library of millions of DNA fragments.

These DNA fragments may be read from both sides; this is called paired-end sequencing and allows for greater fragment lengths to be detected. This also enhances the analysis of the NGS data by providing overlap areas that are duplicated. Alignment scripts used to align sequences of interest to reference genomes take into account the length of the synthesized DNA fragments in the sequence library to reach the most accurate alignments (Korbel et al., 2007).

Errors and limitations of NGS 1.7.4.2

NGS requires a small number of non-specific PCR steps, and this may result in PCR-based error (Poh et al., 2013). The use of high fidelity polymerase enzymes and limiting the number of PCR cycles reduces this error to lower than that seen with traditional PCR-based methods. It is a highly sensitive technique and is also highly prone to cross-contamination error.

Sequence reads tend to be shorter using the Illumina platform than those generated using other methods e.g. Sanger sequencing. This is due to the signal-to-noise ratio that limits the NGS read length (Mardis, 2013). Sequence read length is increasing over time with advances in Illumina-based technology.

Control samples may be included in each run to assess error rates (Hillier et al., 2008b). This can be used to give a detailed picture about 1) the type of error that has occurred within each sequence, for instance, deletion, insertion or substitution; 2) the kind of error mode that should be used to correct sequence reads and 3) missing regions from the sample sequenced (Mardis, 2013).

Many tools have been designed to align large amounts of data from short fragment reads created by NGS instruments against a reference genome.

Defining variants or single nucleotide polymorphisms (SNP) or spotting an over or under-enriched region can be achieved using a variety of different methods.

Mapping 1.7.4.3

The most important step in NGS data analysis is the assembly and mapping of sequence reads to a reference sequence. As NGS produces a huge amount of data, two fundamental issues need to be considered: one is that the required usage of the data needed as NGS produces a huge amount of it and the other is the error profile. Traditional methods require several days to map the data to original reference genomes or computationally intensive software such as Smith-Waterman dynamic programming, BLAT or BLAST. To overcome these problems, new methods have been developed.

Two bioinformatics researchers from Ohio State University initially introduced six different programs to overcome this problem and improved hash/index-based short sequence alignment to reference genomes. These six parallel methods include dividing the reads, dividing the genome, dividing reads and genome, suffix-based assignment (SBA), SBA after partitioning reads and SBA after partitioning genome. Another method called CloudBurst was introduced by Schatz et al. and used to read single-end reads.

Another two complementary algorithm methods were introduced by BreakDancer (BreakDancerMax and BreakDancerMini), which allow analysis of more than one pool and more than one library in more than one sample. New algorithm analysis software was introduced using quality scoring to obtain the maximum accurate analysis from fewer sequences; this software is called GNUMAP (Bao et al., 2011). Recently, more software has been developed and introduced to the market; examples of this are PASS (a program to align short sequences)(Campagna et al., 2009), SOAP (short oligonucleotide alignment program) (Li et al., 2008b), Bowtie, an ultrafast, memory-efficient short read aligner (Langmead et al., 2009), CloudBurst (Schatz, 2009), MAQ (mapping

quality) (Li et al., 2008a), ZOOM (Zillions of oligos mapped) (Lin et al., 2008), SHRIMP (accurate mapping of short colour-space reads) (Rumble et al., 2009), and PERM (efficient mapping of short sequencing reads with periodic full sensitive spaced seeds (Chen et al., 2009).

In this thesis, an in-house mapping programme called Tanoti (manuscript under review) designed by Dr Vattipally Sreenu was used to overcome errors related to the presence of highly divergent viral genomes.

Related documents