Compression - MERGING BURROWS-WHEELER TRANSFORMS

CHAPTER 3: MERGING BURROWS-WHEELER TRANSFORMS

3.4 Results

3.4.2 Compression

One motivation for merging BWTs is to improve the compression. As discussed in Section 2.2, the BWT was proposed as a method for data compression because it tends to create long runs of repeated symbols that can be used by many compression schemes [Burrows and Wheeler, 1994]. The reason long runs form is that the BWT has a tendency to cluster similar suffixes together, and there is an expectation that those patterns have the same predecessor symbol (the value actually stored in the BWT). Therefore, if the two input BWTs to a merge contain similar substrings, then the compression should increase due to the ability to form longer runs. The redundancy of genomic sequencing data results from two factors: the datasets themselves are individually over-sampled and the genomes of distinct organisms tend to share genomic features reflecting a common origin.

Figure 3.1: Merge execution times. This plot shows the relationship between the total size of the input BWTs being merged and the wall-clock time to execute the merge. Each data point is a merge between two BWTs (CAST/EiJ and WSB/EiJ) where each BWT contains a randomly sampled collection of read sequences that were aligned to the mouse mitochondria. In general, the wall-clock execution time follows a linear trend with the total size of the two inputs.

We define an average run-length (RL) metric in order to measure compressibility of the BWT. RL is defined as N_R where N is the total number of symbols in the BWT and R is the number of contiguous symbol runs in that BWT (including runs of length 1). This metric basically represents the compression potential of a BWT where it is better to have a larger average run length. This metric emphasizes the impact of merging on compressibility rather than a particular subsequent compression method (ex. run-length encoding, move-to-front transforms, variable-length coding, Lempel-Ziv [Ziv and Lempel, 1978], etc.).

To demonstrate compressibility, we used the high coverage mitochondria data described in Section 3.4.1. Each dataset was sampled at lower coverages, merged into a single BWT, and then analyzed to identify the impact on average RL. The results of this experiment are shown in Figure 3.2. In general, there is a faster growth in average RL at lower coverages that becomes more constant at higher coverages.

We also performed three other merge experiments using full RNA-seq datasets from Crowley et al. [2015]. The first combined two mouse biological replicates, which were both WSB/EiJ inbred samples. The second was performed on two samples from diverse mouse subspecies CAST/EiJ inbred and PWK/PhJ inbred mouse samples. The final experiment merged eight biological replicates, all of type CAST/EiJ. In all three experiments, the strings were 100 basepair paired-end reads.

Each BWT file was analyzed both separately and as a merged BWT file as shown in Table 3.4. In all three scenarios, the compressibility was improved. The first and third experiments demonstrated that merging biological replicates leads to increased compressibility primarily due to an increase in coverage (the genomes are expected to be the same). The second experiment demonstrated that even with divergent samples of the same species, there is still enough shared sequence to improve the overall compressibility.

Given that average RL is defined as N_R, the total number of bases,N, before and after a merge is constant. If average RL is increasing, then it must be the case that the number of runs, R, is decreasing. The main reason for this is the combining of pre-existing runs as datasets are merged together. To show this, the distributions of run-lengths for the eight-way merge experiment is shown in Figure 3.3 both before and after the merge. In this plot, there are fewer “short" runs and more “long" runs in the merged file, indicating an increase in average run length.

Figure 3.2: Average run length by coverage. This plot shows the average length of runs in the merged BWT at different levels of coverage. Note that as the coverage is increasing, the average run-length increases with it. This effectively means greater compressibility with respect to the original data size. Note that there is faster growth at lower coverages before it eventually settles into a more linear growth at higher coverages.

BWT(s) Symbols RLE Entries Average RL HH1361 individual 6.68∗109 1.13∗109 5.902 HH1380 individual 6.32∗109 .926∗109 6.825 HH1361 + HH1380 13.00∗109 2.05∗109 6.317 HH’s Merged 13.00∗109 1.83∗109 7.086 FF0683 individual 8.94∗109 1.11∗109 8.000 GG1240 individual 14.20∗109 1.36∗109 10.401 FF0683 + GG1240 23.14∗109 2.48∗109 9.320 FF merged with GG 23.14∗109 2.20∗109 10.475 FF0683 individual 8.94∗109 1.11∗109 8.000 FF0684 individual 7.97∗109 ₁_.₄₈_∗₁₀9 _5.361 FF0685 individual 13.11∗109 1.47∗109 8.890 FF0727 individual 7.98∗109 1.58∗109 5.019 FF0728 individual 13.64∗109 ₁_.₆₅_∗₁₀9 _8.267 FF0754 individual 18.36∗109 2.04∗109 8.957 FF0758 individual 13.13∗109 1.92∗109 6.816 FF6136 individual 10.34∗109 ₂_.₀₀_∗₁₀9 _5.146 FF total individuals 93.46∗109 13.3∗109 7.026 FF’s Merged 93.46∗109 9.47∗109 9.865

Table 3.4: Compressibility of merges. This table shows the average run-length (RL) metrics for RNA samples before and after merging. The datasets are all inbred mouse datasets of type CAST/EiJ (FF), PWK/PhJ (GG), or WSB/EiJ (HH). Experiments are grouped into blocks. Each experiment compares the merged results (in bold) to the totals for separate files. Note that in all experiments there is a decrease in the number of run length entries and increase in average run-length when moving from individual files to a single merged file indicating that the merged version is more compressible than separate files.

Figure 3.3: Run-length distribution in a merge. This plot shows the distribution of run-lengths for eight separate CAST/EiJ (FF) RNA-seq BWT files (blue) and a single merged BWT file containing all eight samples (green) from the third experiment in Table 3.4. Note that for the merged file, there are more runs of longer length and fewer runs of shorter length. This is because the merged BWT has brought the similar components of each BWT together leading to longer runs.

In document Holt_unc_0153D_16498.pdf (Page 40-46)