Part II. The immunoglobulin repertoire response to Anthrax Vaccine Adsorbed
3.1 Unique molecular identifiers correct nucleotide errors from sequencing
Nucleotide errors may be introduced during library preparation and processing. We used KAPA Biosystems HiFi polymerase for library PCR amplification, which has a reported error rate of 2.77 x 10-7(2017). Given a common output of 20,000,000 raw reads from an Illumina MiSeq 2x300bp (v3) run, an average fragment length of 450bp, and up to 20 PCR cycles, we can estimate a hypothetical raw dataset will have an error in 0.25% of its products from this source (49,860 sequences). However, a much smaller proportion of reads pass quality, joining and immunoglobulin assignment processing. Our own total AVA project, with 23 sequenced libraries, resulted in 3,513,038 immunoglobulin
sequences, with a mean size of 314bp after pre-processing (joining; adaptor, multiplex index, UID and SMARTer oligo trimming). The viable reads may then be estimated to have PCR error in 0.17% of its products, or 6,111 sequences.
Sequencing errors are also a serious source of concern. Illumina quality scores are considered high quality at Q30 (quality score = 30), which represents a probability of an incorrect call of 1 in 1,000bp, which is acceptable for many applications (2011). For transcriptomics, quality trimming of read ends, many overlapping reads and k-mer-based error correction can be sufficient to lower this rate to negligible levels (Schirmer et al., 2015). After pre-processing, our remaining sequencesβ quality scores are high, with an average value across all nucleotides of Q36. With the caveat that FLASH read joining
outputs a FASTQ file that assigns quality scores per nucleotide to the highest single value in a match, and a value of 2 for any mismatch, we can estimate the number of errors using the Phred quality score calculation:
π = β10πππ10π
Where Q is the quality score and P is the probability of an erroneous base call. We then estimate that there are ~277,085 sequencing-induced nucleotide errors in our dataset. While this is actually a small proportion of the total nucleotides called, it is not evenly dispersed among the total read length. Based on the way Illumina sequencing and read joining is performed, the distribution of nucleotide errors is towards the end of reads, and much of read 2 (Figure S3.1). Looking at the joined reads, we can see that variation in quality scores is increased in one region (Figure S3.2).
The region with the lowest mean nucleotide quality scores post-processing are near the effective joining region and into read 2, around nucleotides 310-390, which includes the CDR3. Prior to processing, there is a distinct drop in quality near the end of each read and phasing and pre-phasing variation accumulates, and this effect is especially precipitous in read 2. During joining, it is these low scoring regions in read 1 and 2 that overlap and nucleotide calls are made. Unfortunately during our library sequencing, a manufacturer issue regarding kit production emerged, and read 2 quality deteriorated, further worsening the problem (Figure S3.1). Therefore, despite the relatively high joined sequence mean quality score, we expect that sequencer-induced mutations may be
Using our UID-tagged data from the variation study, we examined the nucleotide variation introduced within UID groups. Of 212,001 IgVRG sequences, we identified 26,404 UIDs. Of these, 55% (14,560 UID-tag groups) were comprised of a single read. Of the remaining 11,844 UID-tag groups, 76% (9,000) had no errors by alignment, indicating that in 24% of UID-tag groups, at least 1 sequence is affected by nucleotide error . This may be due to a combination of true PCR and sequencing errors, as well as poor read 2 sequencing quality and challenges in joining of two low quality regions.
To illustrate UID heterogeneity, UID-tagged Ig-seq data was processed without assigning UID consensus reads, and B cell clones were estimated using Cloanalyst (Figure 3.1). We see that in a subset of UID-tagged groups, there are many nucleotide errors that have been introduced, though UID-tagged group members are consistently most closely related to other members of the group. However, large UID-tagged groups may also be error free by alignment as in Figure 3.1B. Another interesting clone was observed to have a very high mutation frequency from the inferred clonal ancestor (Figure 3.2). Though this clone is naturally diverse between UID-tag groups, we
identified one read with a large insertion. This insertion is of unknown origin, it does not appear elsewhere in any of the reads in the clone. An error of this sort is of particular concern as insertions and deletions (indels) are a common target of interest in
immunoglobulin research. For example, functional broadly neutralizing antibodies often have long CDR3s, and insertions are often the cause. Additionally, duplication insertions are common in immunoglobulin genes, so errors of this sort may be mistaken for genetic
alleles in reference library generation. It is clear that until reagents and methods are improved, UID tags should be used in Ig-seq to mitigate this serious source of error.
A)
B)
Figure 3.1. Unique molecular identified (UID)-tagged clonal phylograms reveal technical error.
UID-tagged Immunoglobulin sequencing (Ig-seq) data were computationally processed without calling UID consensus reads, and B cell clones were estimated by Cloanalyst.
Maximum likelihood phylograms of two clones were colored to describe sequences with identical UID-tags observed in each clone. A) Four distinct UIDs were observed within this clone, but no UID-tagged group with >1 member is error-free, though they are more genetically similar than between groups. B) Five UIDs were observed within this clone. One large UID-group (UID 5) is error-free between sequences.
A)
B)
Figure 3.2. A UID-tagged clone with an insertion erroneously added during sample processing.
UID-tagged Ig-seq data was processed without UID consensus reads, B cell clones were estimated using Cloanalyst, and a highly mutated clone was chosen for further
investigation. A) Four UIDs were identified within this clone. A maximum likelihood phylogram was colored to describe sequences with the same UIDs. B) The sequences of this clone were aligned using ClustalW and a large insertion was found in one sequence.
3.2 Standard library preparation introduces significant technical variation to biological