Performance Evaluation: WGS data - Wang_unc_0153D

2.4 Discussion

3.2.5 Performance Evaluation: WGS data

We used both simulation and empirical data to assess the performance of AS-GENSENG in detecting CNVs and ASCNVs in WGS data. To compare performance, we applied both

Table 3.2: Methodologies comparisons among WGS methods

AS-GENSENG GENSENG CNVnator ERDS

Allele-specific CNV Y N N N

Analyzing exome sequencing data

Y N N N

Using read-pair information

N N N Y for<10kbps dele-

tions

Bias correction 1-step approach Correct for GC content, mappability, and additional noise in the data.

1-step approach. Correct for GC content, mappability, and additional noise in the data.

2-step approach. Cor- rect only for GC content.

2-step ap-

proach.Correct only for GC content.

Use of allele-specific information

Use of beta-binomial distribution to model the allelic imbalance, and combine it into emission probability of HMM.

N N Modeling the total

number of heterozygous SNPs in each window, and combin- ing it into the emission probability of HMM.

Segmentation HMM-based approach HMM-based approach Mean shift with

multiple-bandwidth partitioning

HMM-based approach

AS-GENSENG, and several state-of-the-art methods to these data, including GENSENG

(Szatkiewicz et al., 2013), CNVnator (Abyzov et al., 2011), and ERDS (Heinzen et al., 2012), using the recommended parameter setups and QC filters. For example, with CNVnator (Abyzov et al., 2011) we usedq0filter which filters out any predictions that have>50% reads with zero-valued

MAPQ (i.e. reads with multiple mapping locations). With ERDS (Heinzen et al., 2012) we removed deletions that are<10kbps and do not have supporting read-pairs. The methodological differences between AS-GENSENG, GENSENG, CNVnator, and ERDS are detailed in Table 3.2. AS-GENSENG differs from existing methods mainly in its incorporation of AS information and simultaneous bias correction.

We simulated two sets of data to assess performance. In the first simulation, we generated 100 TReC plus ASReC datasets. Using chromosome 1 from HapMap sample NA12891 as a template (which provides TReC, covariates, and ASReC), we implanted 200 CNVs (around 60% deletions and 40% duplications, median size = 3,000 bp) by modifying the original TReC. We assigned a randomly generated copy number range of 3-6 for duplications and 0-1 for deletions (all other windows were assigned copy number two). We passed the covariate matrix (columns were the assigned copy number, mappability score, and GC content respectively; rows were each sliding window) and coefficient vector (all initialized as 1) to the garsim function from R/gsarima to simulate TReC for each window. We applied the NB distribution with the log-link function for the garsim model, where the autoregressive parameter was set to 0.6 (because we used overlapping windows); the zero-correction parameter was set to “zq1” and the inverse of the overdispersion

parameter was set to 0.01. As a result, TReCs changed over the CNV windows. In the next step we simulated ASCNV. For each simulated CNV, we computed the mean ASReC of all its spanned windows in the template. If the mean ASReC was larger than a threshold (>10), we called it AS informative. We declared AS informative CNVs as ASCNV and simulated their allelic

configurations (i.e., the number of copies from allelesAandB). We then simulated ASReC (o(A)

ando(as)) based on allelic configurations. Theo(as)for each window was obtained from the template; and theo(A)for each window was computed as follows: 1) Ifo(as) ≤10or the window was not in a CNV,o(A) = 0.5∗o(as); 2) Otherwise, we definedpas the proportion from alleleA

from the allelic configuration, ando(A) ₌_p_∗_o(as)_.

In the second simulation, we mimicked the sequencing experiment by generating paired end reads from a hypothetical chromosome. To simulate reads, we first simulated one pair of

chromosomes using human chromosome 1 as the template and implanted 200 CNVs by modifying the sequence (around 60% deletions and 40% duplications, median size = 3000bps). We specified the allelic configuration for each CNV so that the modification was applied to the proper side of the chromosome (e.g. for a copy number four duplication, if the allelic configuration was ABBB, we duplicated the sequence only from the B allele two more times; if its allelic configuration was AABB, we duplicated the sequences from both alleles once). After creating the hypothetical chromosome, we applied the sequencing simulator, wgsim, as implemented in SAMTools (Li et al., 2009) to generate 100bps paired-end short reads. We used the default wgsim values. In total, 50 millions read pairs were generated and yielded on average 40X coverage. We extracted allele specific reads using NA12891 heterozygous SNPs. Finally, we used BWA (Li and Durbin, 2009) to align the reads back to unmodified reference to obtain TReC and ASReC.

As detailed in the section Data preparation, the empirical datasets used in our evaluation consisted of high-coverage WGS data from two HapMap individuals (NA12891, NA12892) sequenced as a part of the 1000GP (Mills et al., 2011; Abecasis et al., 2012; Handsaker et al., 2011).

To evaluate sensitivity and false discovery rate (FDR), we focused on autosomal CNVs and intersected the predicted CNVs of different methods with the known CNVs. The known CNVs

(Figure 2.7) are either the simulated ground-truth CNVs for the simulated data or the

1000GP-released high-confidence deletions for the empirical data. Sensitivity was calculated as the proportion of known CNVs overlapped by predicted CNVs with the correct CNV type. Following 1000GP, we defined CNV type as deletions (integer copy number 0 or 1) or duplications (integer copy number≥3). Sensitivity for detecting duplications in the two HapMap individuals was not evaluated because of the lack of high-confidence duplications (Mills et al., 2011; Abecasis et al., 2012). The FDR for the simulation data was calculated as the proportion of predicted CNVs not overlapped with the known CNVs. Because the true negatives for the two HapMap individuals are not known, we used the total number of base pairs and the total number of calls as a surrogate measure for specificity. A 50% reciprocal overlap was used as the overlapping criterion in all comparisons.

To evaluate the AS information, we reported the ASCN set. In the simulation, we had the ground truth ASCN set (ASReC>10). Thus we reported the sensitivity and FDR based on the ground truth ASCN set, and compared with the values on the entire set. With the empirical data, we reported the number of detected ASCNs.

CNV detection performance depends on sequencing coverage, especially for

read-depth-based methods. In the comparative analyses conducted in this study, both simulated (40x) and the 1000GP data (>30x) had high coverage. Thus we carried out a computational experiment in order to identify the lower bound on sequencing coverage that AS-GENSENG can handle. In this experiment, we first used Picards DownSampleSam.jar tool to down-sample the high coverage data to varying coverage of 30x, 20x, 10x, and 5x; and next applied AS-GENSENG to each resulting dataset and evaluated the performance using the same metrics.

In document Wang_unc_0153D_15329.pdf (Page 76-79)