We selected the child (NA12878) for our initial analysis because this

Adapted from Genome Research 2016; 10.1101/gr.209841

Section 1). We selected the child (NA12878) for our initial analysis because this

individual was previously phased using parental genotype information and can therefore serve as a reference to assess the validity and precision of our approach. The Strand-seq library for a single NA12878 cell is illustrated in Figure 1B. Within this single cell, reads that aligned to the reference assembly (see Methods, Section

2) covered ~ 5% of the genome and half of the genome was inherited as WC and thus

suitable for phasing (Fig. 1B, red bars). Using single nucleotide variants (SNVs) listed in the HapMap reference for NA12878, we phased 77,717 variant alleles in

3

this single cell (1.34% of reference SNVs), with 99.3% of the phased SNVs matching the reference haplotypes. This result illustrates that Strand-seq can be used to rapidly generate highly accurate chromosome-length haplotypes from single cells.

Building whole genome haplotypes from multiple single cell Strand-seq libraries

In order to build more complete whole genome haplotypes, Strand-seq data from multiple cells were combined (Fig. 2, i). Single cell haplotypes can easily be established by separating W and C specific alleles (Fig. 2, ii). Each single cell library samples the genome in a random fashion. By combining Strand-seq data from multiple cells, subsets of phased SNVs can be compiled into a dense consensus haplotype (Fig. 2, iii and iv). For this purpose, we developed a Strand-seq phasing algorithm and analysis pipeline, called ‘StrandPhase’ (see Chapter 2). Briefly, all WC regions are first identified within each individual cell, and SNVs present on each template strand are phased to build single cell haplotypes. Then, StrandPhase iteratively adds the phased variants from each single cell into two consensus haplotypes based on the best concordance. Accordingly, our algorithm concatenates haplotype information from multiple single cells, reinforcing and validating the phased variants in a consensus haplotype for each homologue (Fig. 2, v).

Figure 2: Phasing of multiple single cell Strand-seq libraries.

(i) Multiple single cells are sorted and processed by our library preparation pipeline to prepare Strand- seq libraries. (ii) In every single cell WC chromosomal regions are identified and homologue specific alleles are recorded. (iii) Single cell haplotypes for a single chromosomes serves as an input for phasing pipeline. (iv) Single cell haplotypes are concatenated together based on the best overlap of haplotype specific SNVs (v) Consensus haplotypes are reported as the best consensus from all single cells.

To evaluate the performance of our analysis pipeline, we selected 183 Strand- seq libraries derived from NA12878. All Strand-seq libraries were preselected based on read depth and coverage distribution in order to avoid phasing errors introduced by low quality libraries (Fig. 3). Using StrandPhase this data was used to build two consensus haplotypes, each representing a phased parental homologue inherited by the child (NA12878) who had previously been extensively phased using parental information as a part of a known HapMap family trio. In a family trio, the child can be unambiguously phased under the assumption that at least one parent is homozygous for a given variable site. Therefore, we used the reference HapMap haplotypes of a child as a gold-standard to assess the validity and precision of our approach.

3

Figure 3: Quality criteria for single cell Strand-seq library.

Criteria for preliminary screen of Strand-seq libraries to select only those suitable for haplotype assembly. Shown are examples of BAIT (Hills et al. 2013) ideograms of libraries categorized by quality. A) Good quality Strand-seq libraries have high (> 200) reads/Mb, an even read coverage profile, low background reads (i.e. reads mapped to opposite direction on chromosomes expected to have unidirectional reads), and no obvious structural rearrangements like copy number changes or aneuploidy events. B) Moderate quality Strand-seq libraries have lower (50-200) reads/Mb, less even coverage profile, low background reads, and no structural rearrangements. C) Poor quality Strand- seq libraries have either low (< 50) reads/Mb, or an uneven coverage profile, high background reads (>5%), or obvious structural rearrangements. Poor libraries were excluded from our analysis. Within high and moderate quality libraries, chromosomes were interrogated for WC inheritance (see Chapter 2, BreakPointR). Chromosomal regions highlighted in red were picked for the haplotype assembly, since in these regions we can separate reads mapping to the plus and minus strand of the reference genome. Note, sometimes only a portion of a chromosome exhibited WC inheritance pattern, visible as a template strand state switch from WC to CC or WW (A, green arrowheads). This occurs when a double strand break is repaired by homologous recombination during DNA replication, resulting in a sister chromatid exchange event. Such WC portions were also selected for our analysis.

Across all 183 libraries, the aligned reads covered a total of 2,156,208 SNV positions, representing 74.6% of the variants listed in the HapMap reference (Table

1). Of the all identified variants, 1,730,627 SNV alleles were assigned to consensus

haplotype 1 (Child H1) and 1,729,512 SNV alleles to consensus haplotype 2 (Child H2) (Fig. 4A), yielding a median distance between all phased alleles of 622 bp (1309 bp for heterozygous alleles). As we increased the number of cells analyzed, SNV coverage increased and distance between subsequent SNVs decreased (Fig.

4C inset), eventually reaching saturation. Next, we compared our haplotypes to the

HapMap reference and found 99.3% of our phased SNV alleles concordant with the reported haplotypes (Fig. 4C). The long-range information of Strand-seq data generated haplotypes spanning centromeres and reference assembly gaps. In addition to continuous stretches of haplotypes, we also observed smaller haplotype switches (Fig. 4C, black asterisks). These switches most likely represent homozygous inversions in these regions (Sanders et al. 2016).

Despite the accurate phasing of SNVs spanning every chromosome in the genome, we found 23,782 alleles (0.7%) that were discordant to the HapMap reference. Strikingly, 52.9% of these discordances were observed in more than one cell in our dataset, supporting the confidence of our allele phasing (Fig. 4B). Because the likelihood of random PCR or sequencing errors occurring at the same genomic position in the same homologue in multiple independent libraries is very low, we propose that discordant phasing at these SNV positions represent either errors in the HapMap reference, polymorphic inversions or somatic mutations in the HapMap cell lines.

Table 1: Summary of sequencing data for each individual sequenced using Strand-seq.

Total number of sequenced libraries for the child (NA12878), father (NA12891) and mother (NA12892) of the family trio analyzed in this study. The number of libraries sequenced as single-end (SE) or paired-end (PE) reads are listed. Genome coverage was calculated per mappable genome (mappability file obtained from the UCSC Genome Browser database - /gbdb/hg18/bbi/wgEncodeCrgMapabilityAlign50mer.bw) and represents the percentage of genomic positions covered by sequencing reads. Depth of coverage represents the average amount of bases sequenced per genomic position. Finally, the percentage of HapMap reference SNVs covered per individual is shown.

3

Figure 4: Accurate and dense whole-genome haplotypes are built from multiple single cell Strand-seq libraries.

A) Venn diagram summarizing the total number of SNVs found in Strand-seq data in comparison to

the HapMap reference. Brown and yellow circles; haplotypes assembled from the Strand-seq data, green circle; HapMap reference SNVs used for validation. Overlaps with green circle shows number of concordant reads in comparison to the HapMap reference. For example, there are 1,290,199 concordant SNV positions covered on both haplotypes, Child H1 and H2. B) All SNV positions found in our Strand-seq haplotypes are plotted by their single cell coverage, which represents the total number of independent cells that supported the variant position. SNVs covered by more than one cell are considered high confidence (black arrow). The SNVs we identified that agree with the variant listed in the HapMap reference are shown in green, and the discordant SNVs (i.e. mismatches) are shown in red. The mismatching SNV positions that are high confidence may represent errors in the HapMap reference or possible de novo mutations in our cell sample. C) Assembled haplotypes of the child derived from 183 Strand-seq libraries. Chromosome ideograms illustrate 151,700 high confidence (covered in more than 1 cell) heterozygous SNV positions phased from Strand-seq data and compared to the HapMap reference. The consensus haplotypes determined by Strand-seq, are depicted for each chromosome, with each SNV represented by a vertical line and color-coded based on whether it matched the child’s reference homologue 1 (brown) or homologue 2 (yellow) listed in the HapMap reference. The contiguous haplotypes extend the whole length of each chromosome, spanning centromeres and reference assembly gaps (white blocks). Discordant alleles that did not match either reference haplotype are shown in red. Asterisks – points to short localized switches in haplotypes that were confirmed as homozygous inversions. Inset (Black line) The percentage of HapMap reference SNVs covered, and (Red line) the median distance between these SNVs is plotted for various numbers of libraries (25, 50, 100, 150), randomly sampled from the entire data set of 183 cells.

Secondary validations of Strand-seq haplotypes

To further confirm the precision of haplotype reconstruction using Strand- seq, we tested haplotyping discordances between Strand-seq and HapMap phasing using publicly-available long-read PacBio RNA-seq data from the same (NA12878) individual (The International HapMap Consortium 2007). We chose PacBio data from RNA-seq because sequenced transcripts holds long linkage information spanning multiple exons within a single PacBio read. This is because, often long, intronic regions are not part of sequenced transcripts (Fig. 5A). For this analysis we cross-referenced the alleles segregating together on each transcript (cDNA molecule) with both the Strand-seq and HapMap-derived haplotypes (Fig. 5B, see Methods

In document Haplotype resolved genomes:Computational challenges and applications (Page 47-52)