• No results found

CHAPTER 3 DETECTION OF COPY NUMBER VARIATION IN SHEEP BY WHOLE

3.3 Materials and Methods

3.4.1 Mapping statistics and CNV detection

The five DNA samples from New Zealand Romney sheep that were sequenced by an Illumina HiSeq 2000 machine each produced about 20 giga bases of high quality sequence data. The average read depth of coverage was from 8.6 x to 12.7 x and coverage to the reference genome was from 81% to 84% (Table 3.3). Using a custom written python script (Appendix 3.4), the genome was divided into 10 kb bins and the distribution of the sequencing depth coverage in different size bins was calculated (Figure 3.2). The bins whose read depth was less than or equal to 6 occupied 0.11% to 6.29% of the bins amongst the five individuals, while the bins whose depth was between 7 and 17 occupied 93.34% to 99.25% of the bins (Additional file Table 3.1 sheet: Depth). A bar chart (Figure 3.2) and a violin plot (Figure 3.3) were created using Excel and R (Appendix 3.5) to show the distribution of depth in each sample. As seen in those figures, individual, 828-05-5 had the highest depth.

The average number of CNV segments detected, after quality control, was 662 per sample (Table 3.3). After merging the CNV segments, 1,836 CNVRs were obtained. Of them, 1653 were losses, 181 gains and 2 were mixes (Figure 3.4). The size of CNVRs ranged between 999 and 73,499 bp, with a mean and median of 3,835 and 1,999 bp, respectively. Figure 3.5, prepared using R (Appendix 3.3), depicts the relationship between sequencing depth and number of CNVs detected in the five individuals, in 50 kb bins across the chromosomal

78

region, ch13:46100000-5110000. It revealed that most losses were confined to low depth areas.

Figure 3.2 Distribution of sequencing depth-size (50 kb) bins.

The Y axis represents bins with different sequencing depth size, while X axis represents frequencies of corresponding depth-size bins. The five colours represent the five different individuals.

Figure 3.3 Violin plots of sequencing depth (at whole genome level) in five individuals

The X and Y axes represent the samples and the log10 (average depth of each bins), respectively. 0 20000 40000 60000 80000 100000 120000 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 25+ Frequencey  o f  differnet  d epth‐ size bins Bins with different depth size 828‐05‐1 828‐05‐2 828‐05‐3 828‐05‐4 828‐05‐5

79

Comparison of CNVs between the five individuals revealed that only 75 CNVs (4%) were common to all five animals and 57% (1,046 out of total 1,838) of the CNVs were unique to an animal (Figure 3.6). The majority of the detected CNVR were less than 5 kb in size (Figure 3.7).

Table 3.3 Summary of the copy number variants (CNVs) detected in five Romney sheep

Sample ID Depth Number of CNVs detected

Number of CNVs after quality control

Deletions Duplications Total Mean Median

828-05-1 8.6x 16,665 425 76 501 4,756.4 2,500 828-05-2 10.1x 18,080 514 82 596 4,096.4 2,000 828-05-3 8.9x 17,059 481 68 549 4,359.7 2,500 828-05-4 10.4x 18,392 587 89 676 4,222.6 2,500 828-05-5 12.7x 20,553 903 85 988 3,427.1 2,000 Average/sample 10.14x 18,149.8 582 80 662 3,835 1,999 3.4.2 qPCR validation

Two randomly selected CNVs detected in an individual were validated by qPCR. The observed (based on qPCR) copy numbers for the two tested CNVs matched (100%) with the predicted copy numbers (Additional file: S3.2 qPCRresult).

80

Figure 3.4 Chromosomal distribution of copy number variant regions (CNVR) detected in five Romney sheep, using whole genome sequencing data.

81

Figure 3.5 Plot showing relationship between sequencing depth and number of CNVs detected in the five individuals, in 50 kb bins across the chromosomal region, ch13:46100000-5110000.

X and Y axis in each graph represent position of chromosome and sequencing depth, respectively. The black points represent the CNVs (losses). The plots were made based on the average depth in 50 kb bins and created using an R script (Appendix 3.3). Majority of the losses were detected in low depth zones.

82 3.4.3 Gene annotation

In total, 587 Ensembl genes were found to be located in the detected CNVRs (additional file: Table S3.1) and NCBI gene IDs could be identified for 501 genes. GO and pathway analysis of the NCBI genes revealed that 19 GO BP (biological process), 14 GO CC (cellular

component), 10 GO MF (molecular function) and 4 KEGG pathways were over-represented (P<0.05) in the identified CNVRs (additional file, Table S1). However, none of the over- represented GO categories or pathways passed multiple testing correction (Bonferroni corrected P<0.05).

Figure 3.6 CNV comparison between five Romney sheep.

Individual sheep are shown in different coloured ovals. Numbers in overlapping regions denote the number of CNVs common to respective individuals while those in non-

overlapping regions are unique for each individual. There is a slight discrepancy with regard to the number of CNV (compared those in Table 3.3) in individuals, as one large CNV detected in an individual could have been detected as several small CNVs in another individual during the analysis.

83 3.4.4 Pedigree comparison

In the offspring 828-05-01, 355 CNVs (71% of the total in the individual) could be traced from its parents (Figure 3.8), while in another progeny, 828-05-03, 360 CNVs (65.9% of the total in the individual) were traced from its parents (Figure 3.9). Further, out of the CNVs inherited by the two progenies (106 and 133 CNVs, respectively, by 828-05-01 and 828-05- 03), exclusively from the sire, 26 CNVs overlapped (Figure 3.10).

3.5

Discussion

3.5.1 Mapping statistics and CNV detection

Figure 3.2 and 3.3 showed that the most predominant depth of the reads in the five samples was about 9X. As expected, the number of CNV detected in the samples increased with increased sequencing depth. The individual with the highest depth, 828-05-5, was found to have maximum number of CNVs while the one with the lowest depth, 828-05-1, had the least CNVs, showing that the depth of coverage is a key factor in CNV detection from NGS data. Before quality control, there were about 18,000 CNVs detected in each sample. However, because of too many gaps (about 120,000) existing in the Oar_v3.1 assembly, the number of CNVs were dramatically reduced to about 500 in each sample. These gaps are the zones on the reference genome that include highly repetitive sequences. The sequencing reads from those regions could not be mapped to the reference genome. Normally, CNVs detected around gaps are considered unreliable. A cattle study with 20 cattles (Dolezal et al. 2014) using the same software, CNVnator, identified 29,975 deletions, 1,489 duplications and 365 complex CNVRs, which were much higher than those detected in this study. However, Dolezal et al (2014) used a 20X depth and only 63,000 gaps were reported in the bovine genome assembly. Hence, a completed genome assembly and a higher depth of sequencing might be necessary for CNV detection.

84

Figure 3.7 Frequency distribution of the size range of copy number variant regions (CNVR) detected in five Romney sheep, using NGS.

Frequencies of the detected CNVR in different size ranges are shown.

292 680 326 163 67 46 37 37 28 25 20 15 14 8 9 9 4 8 6 6 5 3 4 3 1 4 2 1 13 0 100 200 300 400 500 600 700 800 Frequencey  o f   CNVR Size of CNVRs

85

Figure 3.8 Inheritance of CNV in individual 828-05-1.

Pink, purple and blue circles represent CNVs detected in animals 828-05-5, 828-05-4 and 828-05-1, respectively. Numbers in overlapping regions denote the number of CNVs

common to respective individuals while those in non-overlapping regions are unique for each individual. There is a slight discrepancy with regard to the number of CNV (compared those in sheep) in individuals as some CNVs were merged during the analysis because one large CNV could be divided into several small CNVs in another individual.

86

Figure 3.9 Inheritance of CNV in individual 828-05-3.

Orange, green and blue circles represent CNVs detected in animals 828-05-5, 828-05-2 and 828-05-3, respectively. Numbers in overlapping regions denote the number of CNVs

common to respective individuals while those in non-overlapping regions are unique for each individual. There is a slight discrepancy with regard to the number of CNV (compared those in sheep) in individuals, as one large CNV detected in an individual could have been detected as several small CNVs in another individual during the analysis.

87

Figure 3.10 Comparison of CNVs inherited by the two half-sibs, exclusively from their sire.

Orange and green circles represent CNVs detected in animals 828-05-1 and 828-05-3, respectively. Numbers in overlapping regions denote the number of CNVs common to both individuals while those in non-overlapping regions are unique for each individual.

Comparison of CNVs between the five sheep revealed that only 75 CNVs (4%) were common to all 5 animals and 57% (1046 out of total 1838) of CNVs were unique to an animal (Figure 3.6). This could be due to huge differences between individuals or low coverage of NGS data which might result in CNV missing during CNV detection. Besides, Figure 3.5 reveals that most of the CNV losses were detected in low depth regions on the chromosomes, which suggests that sequencing depth has a huge influence on CNV detection.

88

Also, the majority of the CNVR detected in this study were less than 5 kb in size (Figure 3.7). Comparison of the CNVR from this study with those from previous studies in sheep revealed that NGS based CNV detection would provide better resolution (in terms of high CNVR number, but smaller in size) than microarray or aCGH based detections (Table 3.4).

3.5.2 qPCR validation

Leftover DNA (after NGS) was available for only one sheep and the study individuals were no longer alive. Hence, CNV validation, using two pairs of PCR primers, was undertaken on only one animal and the qPCR results corroborated the predicted copy numbers of those two CNVs in the animal. However, such small size of sample for validation reduced the

confidence of this study.

3.5.3 Gene annotation

Gene Ontology (GO) and KEGG analysis showed that genes over-represented in the detected CNVRs were associated with brain morphogenesis, the cytoskeleton, cell junctions and calcium ion binding. However, none of those genes passed the threshold for Bonferroni correction for multiple testing which suggests so far the association between these genes and CNVs is unclear.

Related documents