• No results found

Gyper’s accuracy was verified using three different verification datasets: An in-house WGS dataset, 1000 Genomes exome dataset, and 1000 Genomes low coverage WGS dataset (Table 4.3). The in-house sample dataset is not publicly available but 1000 Genomes samples are available on their FTP site [1000Genomes, 2015].

Samples in the 1000 Genomes exome dataset have at least 20x coverage. Exomes are the part of the genome that are formed by exons. Exons only account for about 1% of the genomes so alignment files storing such data are much smaller using the same read coverage. Samples in 1000 Genomes’ low coverage WGS dataset have at least 3x coverage.

The 1000 Genomes datasets have been verified by Erlich et al. [2011] for all the HLA class I genes: HLA-A, HLA-B, and HLA-C. It is a widely used verification dataset for many HLA genotypers and sometimes called the ‘gold standard’ for HLA genotypers. We genotyped the class I genes and compared Gyper to other HLA genotypers. The called genotypes of the 1000 Genomes samples are shown in appendices A and B.

Table 4.3: Number of individuals in each verification dataset. Gene deCODE WGS (sequenced) deCODE WGS (imputed) 1000 Genomes exome 1000 Genomes WGS HLA-A 35 221 180 20 HLA-B 179 1345 180 20 HLA-C 167 1310 180 20 HLA-DQA1 45 46 0 0 HLA-DQB1 79 141 0 0 HLA-DRB1 185 352 0 0 Total 690 3415 540 60

The calling accuracy is the fraction of alleles Gyper correctly calls. Gyper calls the genotype with Smax. However, in some cases more than one genotype has a score Smax which leads to ambiguous results. In this case it is undefined which genotype Gyper calls. We believe a better quality measurement is to use coefficient of determination, r2. When calculating r2 we use Gyper’s probability of a genotype, P .

Additionally, we checked how often Gyper’s called the zygosity of the samples matched the experimentally determined zygosity. Both r2 and the zygosity calling accuracy are only calculated using 4 digit resolution.

4.3 Verification

4.3.1 deCODE’s samples

deCODE has a large dataset with samples taken from the Icelandic population. Thou- sands of them have been sequenced using Illumina machines, aligned to the human genome, and stored in BAM files. They also have genotyped a portion of them for the six most important HLA genes using laboratory genotyping methods. The class I genes HLA-B and HLA-C were genotyped with a 2 digit resolution. The other four genes

HLA-A, HLA-DQA1, HLA-DQB1, and HLA-DRB1 were typed with a 4 digit resolution.

Overall 3600 genes have been genotyped using this method, which were used as a veri- fication dataset for Gyper. Genotyping individuals this way is expected to have a high accuracy but they are costly and time consuming.

For deCODE’s dataset we did two kinds of tests. One is where Gyper genotyped sequenc- ing files for all individuals that have both been sequenced and are part of the verification data. Unfortunately, that is only the case for 18.85% of the individuals in the verification data. The other, was to genotype the same 3,894 individuals and imputed data for other individuals in the dataset. An important use case of Gyper is to be able impute its output for a large population which can then be used in association studies. This allows us to use a much larger portion of deCODE’s verification data, 93.30%. We cannot use the entire verification data because the imputation is unable to determine an individual’s genotype if the genotypes of the its relatives are unknown.

Table 4.4: Gyper’s 2 digit genotype call accuracy compared to deCODE’s verification data.

Gene 0 errors 1 error 2 errors Correct alleles Accuracy

HLA-A 33 2 0 68 of 70 97.1% HLA-B 167 10 2 344 of 358 96.1% HLA-C 157 7 3 321 of 334 96.1% HLA-DQA1 45 0 0 90 of 90 100.0% HLA-DQB1 77 2 0 156 of 158 98.7% HLA-DRB1 183 2 0 368 of 370 99.5% All genes 662 23 5 1347 of 1380 97.6%

Table 4.5: Gyper’s 4 digit genotype call accuracy compared to deCODE’s verification data.

Gene 0 errors 1 error 2 errors Correct alleles Accuracy r2

HLA-A 33 2 0 68 of 70 97.1% 0.9422

HLA-DQA1 45 0 0 90 of 90 100.0% 1.0000

HLA-DQB1 70 9 0 149 of 158 94.3% 0.8977

HLA-DRB1 162 21 2 345 of 370 93.2% 0.8396 All genes 310 32 2 652 of 688 94.8%

We genotyped all individuals in our sequenced dataset and genotyped them using only their respective BAM files. Since each individual has two alleles Gyper’s prediction could

4 Results

have 0, 1, or 2 errors for each individual when comparing with the verification. We use the number of correctly predicted alleles divided by the total number of alleles predicted to estimate Gyper’s accuracy.

The overall genotype call accuracy of Gyper was 97.6% and 94.8% using 2 and 4 digit resolutions, respectively (Tables 4.4,4.5). Zygosity was correctly called in 94.2% cases. For the imputed data Gyper’s accuracy was 96.8% and 96.1% for 2 and 4 digit resolution, respectively (Tables 4.6,4.7) and the zygosity call accuracy was 97.1%. In tables 4.5 and 4.7 the HLA-B and HLA-C genes are excluded because only has the first 2 digits are known.

Table 4.6: Gyper’s 2 digit impute accuracy compared to deCODE’s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy

HLA-A 215 5 1 435 of 442 98.4% HLA-B 1264 67 14 2595 of 2690 96.5% HLA-C 1212 79 19 2503 of 2620 95.5% HLA-DQA1 46 0 0 92 of 92 100.0% HLA-DQB1 141 0 0 282 of 282 100.0% HLA-DRB1 351 1 0 703 of 704 99.9% All genes 3229 152 34 6610 of 6830 96.8%

Table 4.7: Gyper’s 4 digit impute accuracy compared to deCODE’s verification data. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r2

HLA-A 215 5 1 435 of 442 98.4% 0.9652

HLA-DQA1 46 0 0 92 of 92 100.0% 1.0000

HLA-DQB1 126 12 3 264 of 282 93.6% 0.8597

HLA-DRB1 321 28 3 670 of 704 95.2% 0.9347 All genes 708 45 7 1461 of 1520 96.1%

4.3.2 1000 Genomes exome samples

Total 180 exome BAM files were fetched from the 1000 Genomes FTP site and genotyped for the three main HLA class I genes. The samples were taken from individuals from with ancestry from all over the world. Gyper’s genotype call accuracy was 99.3% and 97.9% using 2 and 4 digit resolutions, respectively (Table 4.8,4.9). For all genes r2 > 0.95 and zygosity calling was correct in all 540 cases.

4.3 Verification

Table 4.8: Gyper’s 2 digit exome accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy

HLA-A 176 4 0 356 of 360 98.9%

HLA-B 176 4 0 356 of 360 98.9%

HLA-C 180 0 0 360 of 360 100.0%

All genes 532 8 0 1072 of 1080 99.3%

Table 4.9: Gyper’s 4 digit exome accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r2 HLA-A 171 9 0 351 of 360 97.5% 0.9503

HLA-B 168 12 0 348 of 360 96.7% 0.9536

HLA-C 178 2 0 358 of 360 99.4% 0.9924 All genes 517 23 0 1057 of 1080 97.9%

4.3.3 1000 Genomes WGS samples

We also verified Gyper using 20 low coverage WGS alignment files obtained from the 1000 Genomes project. These files have at least 3x non duplicated aligned coverage. Here, the accuracy of Gyper was 96.7% and 95.0% for the 2 and 4 digit comparisons, respectively (Tables 4.10 and 4.11). Zygosity was correctly called for 98.3% of the individuals.

Table 4.10: Gyper’s 2 digit low coverage WGS accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy

HLA-A 20 0 0 40 of 40 100.0%

HLA-B 17 3 0 37 of 40 92.5%

HLA-C 19 1 0 39 of 40 97.5%

All genes 56 4 0 116 of 120 96.7%

Table 4.11: Gyper’s 4 digit low coverage WGS accuracy compared to Erlich et al. [2011]. Gene 0 errors 1 error 2 errors Correct alleles Accuracy r2

HLA-A 18 2 0 38 of 40 95.0% 0.8724

HLA-B 17 3 0 37 of 40 92.5% 0.8634

HLA-C 19 1 0 39 of 40 97.5% 0.9530

All genes 54 6 0 114 of 120 95.0%

Over all datasets Gyper managed to predict 9,145 out of 9,410 alleles total at the two digit resolution, having an accuracy of 97.2%. With 4 digit resolution we predicted 3,284 out of 3,408 alleles (96,3%) correctly.

4 Results

4.4 Comparison with other DNA sequencing data

Related documents