• No results found

Proper assignments of reads to the molecule and cell they were captured in is a critical step in scRNA-seq analysis. Under ideal circumstances, the combination of the UMI and cell describe the cell of origin. However, there are two major sources of error that are introduced during library construction: primer fracturing and barcode substitution errors. These errors confuse this association by disrupting the matching of observed barcodes with the barcodes that were present on primers during mRNA capture.

Cell barcode errors in InDrop (Figure 2.8 C) are easy to detect by design: we have a whitelist of 147,456 barcodes, each with Hamming distance >= 3. Thus, any single base substitution error is resolved by creating a lookup table for all barcodes and all single substitutions. If found in the

table, the barcode is corrected. If not, it is discarded. As estimated above, the probability of a cell barcode containing an error is ~2%, and thus the expected rate of barcodes accruing 2 errors in a barcode is 1 / 2500. A 2-error lookup table has a very large memory footprint and would significantly increase computational cost of processing each experiment. Alternative algorithms have greater complexity and would increase run time. Thus, we accept this low rate of loss and proceed to correct single base errors, recovering approximately 2% additional data for each sample.

This error rate, while high, is easy to correct and results in minimal data loss. Although a 4% barcode error rate is higher than the error rate observed by other technologies, it directly results from the use of linear amplification. If errors are independent, then the probability of obtaining 2 or more errors, which would produce an uncorrectable barcode, is 1 - cdf beginning at 2 for a Poisson distribution with λ = 0.04. Given the observed error rate, we estimated that only 0.035% of barcodes would be uncorrectable. This allows SEQC to correct errors using a fast hash-based strategy. This is in contrast to PCR-based amplification approaches which propagate errors that occur in early cycles, requiring more complex, graph-based correction methods, and larger Hamming buffers (see https://github.com/vals/umis).

In contrast to cell barcodes, UMIs are random, and correction cannot proceed by the same strategy, so we devote a section later in the pipeline to the detection of UMI errors after the gene and mapping position of a fragment are identified.

Another source of error in scRNA-seq experiments, including InDrop, cel-seq, mars-seq, and likely drop-seq and 10x genomics, is the fracturing and random-priming of capture primers (Fig- ure 2.8 A, B) (Jaitin et al., 2014). We often observe cell-barcode prefixes followed by randomers. When fragmentation occurs at the cell barcode level, we can remove the fragments using the

Figure 2.8: This figure displays a schematic that describes the types of barcoding errors that can occur in InDrop data, but also other approaches that utilize 3’ or 5’ capture by poly-A primers. These error sources are: (A) the barcode fragments within the UMI sequence; these barcodes may randomly prime, if broken before encapsulation, or produce incorrect UMIs or fragment-fragment hybridization events, if this occurs during library preparation. (B) A barcode that breaks within the cell barcode. Because InDrop has a set of valid barcodes, these errors are easily excluded, but the process that produces these errors is the same as in (A). In (C) we display barcode sub- stitution errors, which may cause barcodes to aberrantly manifest as separate cells or molecules, depending on which barcode they occur in.

whitelist approach above. To remove barcodes that break in the UMI, we determined that we would sequence 5 bases into the poly-T tail of the primer, which we expect to be all T-nucleotides. By excluding reads with more than 1 non-T nucleotide, we are able to exclude most broken UMIs.

The second source of error are barcode fractures, observed in both CEL- and SMART-seq chemistry. A barcode fracture occurs wherein some prefix of the Cell, and UMI, and in the case of InDrop, also the spacer and poly-T tail, is observed, but the remainder of the read corresponds to non-barcode information. Barcodes that break in cell barcode sequences will be excluded by cell

barcode correction, as described above. However, UMIs are not a priori known; if an aligned read breaks inside the UMI sequence during amplification, it will manifest as a new molecule despite having a proper, full-length UMI that it should be associated to. This will result in inflation of UMI counts for the matched gene.

To test for the presence of these types of errors, we used a trie data structure to efficiently count the largest hamming-corrected cell barcode prefix observed in each of our aligned sequenc- ing reads. We used the hamming-corrected barcodes because we reasoned that substitution errors would be the most common error type, and wanted to exclude those from analysis, as they are corrected through other methods, described above. The largest cell barcode prefix is most often a complete cell barcode, owing to the high quality of InDrop data. However, for 4.7% of our sequencing reads, the prefix is a partial barcode.

These partial barcodes could arise from multiple sources. One option is an insertion or dele- tion in the barcode. Errors of this type would produce a frame shift in the barcode. A second option is that the barcode has broken, and the broken end acted as a randomer, an alternative capture strategy to poly-A capture. To differentiate between these cases, we calculated, based on the list of known InDrop barcodes, which suffixes match single base insertions or deletions (indels). We then determined whether there was an existing barcode that explained each broken primer.

Cases where an indel explains the observed barcode prefix were very rare (approximately 1/4000), and most prefixes did not contain the expected poly-A tail at any offset. Thus the more likely explanation is random priming. As a result, we assume that reads missing the poly-A tail may have fractured within the UMI, and those reads are flagged for filtering. In aggregate, the filters in this section remove an average of 36% of reads (sd = 9.3%), depleting the count matrix of

Figure 2.9: A mock-up of the error correction query for a single molecule (red, top) using the Jaitin method (left) and the SEQC approach (right). Each table displays all of the molecules with a cell barcode that matches the top sequence that have a molecular barcode (1st column) within 1 edit distance of the query barcode (red) and their abundances (second column). The Jaitin approach (A) will discard any molecule when another molecule is observed to align against the same gene in the same cell. In this example, each of the bolded barcodes (0, 2, 4) would be discarded. The SEQC approach (B) Builds a probability model to estimate the expected rate that each barcode would convert into the query barcode.

spurious molecules (see Table S1 for detailed values). These values are consistent with the results of running SEQC on drop-seq or MARS-seq datasets (data not shown).