Chapter 6 Core study: using the gene panel to analyse a large cohort of ALS and control
6.1. Overview
We performed a pilot study using NGS as a tool for examining the genetics of ALS in 95 patients (Chapter 3). This method revealed a number of interesting results and provided a basis for the principle investigation where we aimed to implement this technology on more than one thousand patients.
In terms of the loci included in the gene panel, there were a few containing regions of DNA which failed to sequence. With this in mind, the genes BSCL2, CEP112 and VEGF were removed from the project as their connection to ALS still remains poor and we cannot achieve high quality data on them. Additionally, the amplicons covering FUS, OPTN and SETX were redesigned to obtain optimum coverage. Lastly, although the gene SPG11 has a potential involvement in ALS, the sheer size of it presented as a challenge to both sequence and analyse especially considering the fact that compound heterozygous cases have been observed and the chances of two rare variants occurring in this gene are much higher. It is for this reason that we decided not take this gene through to the next part of our experiment. However, since I have WES data on the controls, I also examined the removed genes within this control cohort to compare to the test plate data. Since the completion of the test plate, two other genes were gaining popularity as risk factors for ALS, namely TREM2 and PFN1. Therefore these genes were added to the ALS panel.
6.1.1. METHODS
A requirement for the final design of the project was to balance the available funds with both the number of patients and amount of DNA targeted. The resulting design extensively covered all exons and both untranslated regions (UTRs) of SOD1, TARDBP, FUS, VCP, OPTN, and
UBQLN2. Then the rest of the genes were covered at mutation hotspots: ALS2, ANG, CHMP2B, DAO, DCTN1, FIG4, NEFH, PFN1, PON1, PON2, PON3, PRPH, SETX, SQSTM1, TREM2 and VAPB. These hotspots were ascertained by use of an ALS variant database I
105 as per the descriptions in Chapter 2. I performed identical analysis on whole-exome sequencing data from 510 control patients pulling out the same genomic regions which were covered by the MiSeq panel. I included four common sex markers to ensure that subjects matched their stated gender.
As mentioned NGS is unable to reliably assay long repeats such as those in C9orf72 and
ATXN2. Therefore, repeat-primed PCR was used to detect the expansion mutation in C9orf72.
Our collaborators at Kings completed standard fragment length analysis for the microsatellite repeat in ATXN2. Although, because our control samples did not have ATXN2 data, it was not included in our case-control analysis.
6.1.2. RESULTS
A total of 1,131 subjects were run on the MiSeq which included 100 controls, 124 fALS and 917 sALS. The majority of these (n = 1,074) were from the MNDA DNA bank while 33 of the controls were from IPDGC and the remainder were Argentinian samples (18 sALS and 6 fALS) which are discussed separately in Chapter 7 since their ethnicity varied from the rest of the cohort. WES data was provided on 510 controls. Chapter 2.1 contains a more detailed description of all samples. 1.6% of samples completely failed to sequence which included five controls (all from the WES), eleven sporadic and two familial subjects. A further three sporadic patients and five controls failed to sequence adequately for some of the desired genomic area but this varies slightly depending on loci or gene region under examination. Therefore some genes, when examined independently, had slightly higher subject numbers.
Following initial standard quality checks to remove false positive calls, 52,804 variants remained. As per the method described in Section 2.5.2 a total of 29,930 images were taken of flagged variants which were then examined by eye. Of these, 8,654 were kept which would normally be discarded by a computer, while 4,621 which passed quality checks were clearly false positives (8.8% of all calls). The final number of mutations averages at 41 alterations per person (range 24-72) which is mostly due to common polymorphisms and includes intronic and synonymous SNPs. There were some minor regions which were covered adequately by only one of the technologies used (either WES or MiSeq) and variation within these regions were not included in most analyses except when examining an individual variant against the
106
published literature. 317 patients did not have complete C9orf72 data because of insufficient DNA. Of those typed for repeat expansions in this gene, 45 of 654 (6.9%) sporadic patients had the mutation as did 11 of 72 (15.2%) familial cases.
Comparing the 33 controls post-filtering which had been examined using whole-exome and MiSeq-targeted sequencing revealed identical calls in all subjects except for a few intronic variants (which WES does not capture) and for indels of two or more nucleotides. The differences in indel calling lay mostly with mononucleotide repeats which are known to cause problems in NGS. All indels of two or more nucleotides across patients and controls were removed from the analysis to ensure no technological biases were driving the differences between the two groups. For SNPs, WES and targeted sequencing both produced the same results and therefore the former can be reliably used as controls for my dataset.
One of the major difficulties in NGS data is how to interpret variants, especially those which are novel or extremely rare. We found 906 alterations which are defined as such, of which 225 are exonic and 138 are previously published with respect to ALS or another disease, however, some of these also occur in our control cohort. Variants were deemed likely to be causal if they were published previously and not found in control cohorts. Under this interpretation of pathogenicity, 103 patients in 1,007 can be explained (10.2%; Table 15) by mostly C9orf72 repeat expansions (4.9%) but also SOD1 (2%), TARDBP and FUS (both 1%). However, as mentioned C9orf72 is potentially higher than this frequency due to missing data in a number of patients.
SPG11, although not examined in patients, was examined independently in the control
dataset. Subjects had an average of 1.7 mutations in this recessively causal gene, with 24% of individuals harbouring a homozygous variant and 42% with a potential compound heterozygous variant. A total of 1% of controls had 5 mutations in SPG11 showing it is a highly mutated gene. Additionally, one control had a stopgain mutation in this gene.
107 Gene
Familial Sporadic Controls All Likely pathogenic All Likely pathogenic All Likely pathogenic ALS2 3.7% 0.9% 6.6% 0% 8% 0% ANG 0% 0% 0.4% 0% 1.3% 0.3% CHMP2B 2.8% 0% 2.9% 0% 11% 0% DAO 0.9% 0.9% 0.2% 0.1% 0% 0% DCTN1 0% 0% 0.9% 0% 1.5% 0% FIG4 0.9% 0% 0.6% 0.1% 7.2% 0% FUS 2.8% 1.9% 2.4% 0.6% 12% 0.2% NEFH 27% 0% 18.6% 0% 38% 0% PFN1 4.6% 0% 7% 0.2% 6.3% 1.2% PRPH 2.8% 0% 1.4% 0% 2% 0% SETX 3.7% 0% 4.2% 0% 6% 0% SOD1 8.3% 7.3% 1.1% 0.8% 0.3% 0% SQSTM1 1.9% 0% 1.6% 0% 1.7% 0% TARDBP 6.5% 4.6% 1.8% 0.8% 1% 0% TREM2 0.9% 0.9% 0.8% 0.7% 0.7% 0.7% UBQLN2 1.9% 0% 2.6% 0.1% 1.7% 0% VAPB 0% 0% 1.6% 0% 1.2% 0% VCP 1.9% 0% 1.6% 0.1% 7.7% 0% T a b l e 1 5 . P e r c e n t a g e o f p a t i e n t s w i t h c o d i n g m u t a t i o n s i n e a c h g e n e . V a r i a n t s w e r e c o n s i d e r e d t o b e l i k e l y p a t h o g e n i c i f t h e y f u l f i l l e d s e v e r a l c r i t e r i a f r o m T a b l e 6 . T h e s e r e s u l t s d o n o t t a k e i n t o a c c o u n t m i s s i n g d a t a a n d s o a c t u a l n u m b e r s m a y b e s l i g h t l y h i g h e r .