USING SEQUENCING TECHNOLOGIES (AND OTHER LARGE-

TOXOPLASMA GONDII (Adapted From Paper)

New sequencing technologies have made many species accessible to genomic- scale analysis, raising the challenge of how to integrate such information from various sources, discriminate biological significance, and make these results accessible to diverse end-user communities. We have exploited strand-specific RNA-seq analysis to profile the transcriptome of human host cells infected with the protozoan parasite Toxo-

plasma gondii, a prominent eukaryotic microbial pathogen responsible for disease during

congenital infection and in immunosuppressed individuals (Tenter, Heckeroth, and Weiss 2000).

At ~65 Mb in length, the T. gondii genome is relatively compact (Lorenzi et al. 2016), but harbors most of the complexity described for other eukaryotic genomes, including ~8300 protein coding genes, ranging in length from <1 to >60 kb (ave ~4.8 kb), and fragmented by introns (range 0-60+; ave ~5.8) that follow consensus eukaryotic sequence constraints (primary T. gondii transcripts are properly spliced by human nuclear extracts). Extensive population genetic and functional genomic datasets are available for T. gondii, including additional RNA-seq data and other transcriptional

profiles for various strains and developmental stages, proteomics data, chromatin marks,

etc (ToxoDB.org; Gajria et al. 2008). As noted above (Chapter 1), numerous tools are also available for experimental manipulation of T. gondii in the laboratory (Roos et al. 1995; Sibley et al. 2002; Kim and Weiss 2004; Meissner et al. 2007; Sidik et al. 2016). We have generated strand-specific RNA-seq datasets for various T. gondii tachyzoite

strains, at multiple time points during their 48 hr intracellular in vitro replicative cycle, and analyzed these in parallel with datasets from other life cycle stages, to assess the

accuracy of current genome annotation, identify new genes (including alternatively spliced transcripts), and assess stage-specific transcript expression and regulation (Chapter 3).

Methods

Parasite cultures, RNA isolation, RNA library construction and sequencing

T. gondii tachyzoites from four different strains (ME49, VEG, RH, GT1), were main-

tained by serial passage in human foreskin fibroblast (HFF) monolayers as previously described(Roos et al. 1995), infecting confluent monolayers with ~107_{tachyzoites. For} time course experiments, parasites were propagated in Vero cells (Cercopithecus aethi- ops) for two passages immediately before the infection of HFFs for RNA isolation, to avoid inadvertent contamination with human material from the previous infectious cycle. After media removal, cell monolayers were scraped in 700ul of Qiazol, and RNA isolated from cell lysates using the Qiagen miRNEasy mini kit, according to the manufacturer's instructions. Six biological replicates were collected per strain and timepoint, and checked for RNA quality using a BioAnalyzer (Agilent). Strain M4 bradyzoites, strain CZ-H3 enterocytes (gametocytes), and strain M4 oocysts were prepared and RNA isolated as previously described (Buchholz et al. 2011; Fritz, Buchholz, et al. 2012; Juránková et al. 2013; Basso et al. 2013; Hehl et al. 2015). Biological replicates were pooled, and total and polyA+ selected RNA used to construct strand-specific mRNA (and in some cases small non-coding RNA) libraries as previously described (Li, Zheng, Vandivier, et al. 2012; Elliott et al. 2013), and sequenced on Illumina Hi-Seq 2000 (see

Table 1 for total number of reads per library). All RNA-seq data described in this study are available from the Toxoplasma Genome Database, at ToxoDB.org (Gajria et al. 2008), along with other RNA-seq datasets and diverse additional information (Fritz, Bowyer, et al. 2012; Minot et al. 2012; Reid et al. 2012; Lorenzi et al. 2016).

Alignment of RNA-seq reads to the T. gondii genome: the ToxoDB pipeline for

mapping RNA-seq reads

For transcript assembly, RNA-seq reads were initially mapped onto T. gondii ME49 genome release 28 using RUM (Grant et al. 2011); subsequent studies used GSNAP (Wu and Nacu 2010) to map to genome version 29. The RUM alignment pipeline takes advantage of the speed of Bowtie (Langmead et al. 2009) to map against both the genome and transcriptome; unmapped reads are then mapped against the genome using Blat (Kent 2002), and Information from all three mappings is then merged. GSNAP was configured to look for both known and novel splicing. Coverage was determined for unique and non-unique alignments (separated by strand when possible). Strand orientation of splicing was determined based on the usage of GT/AG, GC/AG, or AT/AC dinuc- leotide pairs on the plus strand (or their complements on the minus strand). In cases where strand could not be defined, the program applies a probabilistic splice model to determine orientation (Wu and Nacu 2010). Performance for both RUM and GSNAP tools is comparable to the best RNA-seq alignment tools available at the time this study was completed (Engström et al. 2013).

Algorithm for gene model learning and prediction

Gene model training and predictions were performed using a version of CRAIG (A. Bernal et al. 2007) that integrates RNA-seq data, encoded as features derived from the

mapping of reads to the T. gondii genome, as described above. The evidence integra- tion strategy and feature encoding for RNA-seq data have been reported previously (Bernal, Crammer, and Pereira 2012; Bernal and Pereira 2012). ToxoDB v28 gene annotations were used for training, after filtering to exclude genes with evidence of significant alternative splicing (see below). The learned model integrates ab initio features such as segment length distributions, with features derived from junction-spanning and coverage reads. We sought completeness in transcript prediction by forcing CRAIG to define at least one transcript model for each non-overlapping transcript junction (puta- tive intron) with read counts >3, and for overlapping junctions, those displaying >20% of the highest support observed in any overlapping junction within the same gene model.

Assessment of genome annotation, analysis of alternative splicing and visualiza- tion

To assess the quality of the reference T. gondii genome annotation, constructed largely based on ab initio methods informed by EST sequences from tachyzoite stage parasites only, expression data was retrieved for all predicted introns in 69 RNA-seq samples available in ToxoDB.org (Table 1), including samples from many strains, and most T. gondii life cycle stages (see Chapter 1 for a description of the parasite’s asexual and sexual life cycles). This yielded a list of 2,731,523 candidate introns, but many were observed in only one or two samples, or at very low abundance levels in any sample. Introns observed <6 times overall, or with <3 read in any of the 69 samples considered, were excluded from further analysis, leaving a total of 147,715 for in-depth analysis (including 997 previously-annotated introns not satisfying the above criteria).

Table 1. List of T. gondii RNAseq datasets used in this study (ToxoDB release 28)

Ref Strain Stage Host Cond Time RNA Str Spec Ins Size Read Ln Total Reads * Tg Unique * % † Total ISRs ‡ %

Intracellular Tachyzoites -- Diverse Strains

GT1 Sibley 1 1 GT1 TachyzoiteHFF cells In vitro ME49 Sibley 2 1 ME49 " " " " " " unknown 100? ARI 3 2 ARI TachyzoiteHFF cells In vitro 72 B41 4 2 B41 " " " " " " " " B73 5 2 B73 " " " " " " " " BOF 6 2 BOF " " " " " " " " CAST 7 2 CAST " " " " " " " " CASTELLS 8 2 CASTELLS " " " " " " " " CEPdelta 9 2 CEPdelta " " " " " " " " COUGAR 10 2 COUGAR " " " " " " " " DEG 11 2 DEG " " " " " " " " FOU 12 2 FOU " " " " " " " " GPHT 13 2 GPHT " " " " " " " " GT1 14 2 GT1 " " " " " " " " GUYDOS 15 2 GUYDOS " " " " " " " " GUYKOE 16 2 GUYKOE " " " " " " " " GUYMAT 17 2 GUYMAT " " " " " " " " MAS 18 2 MAS " " " " " " " " ME49 19 2 ME49 " " " " " " " " P89 20 2 P89 " " " " " " " " PRUdelta 21 2 PRUdelta " " " " " " " " RAY 22 2 RAY " " " " " " " " Rhdelta 23 2 Rhdelta " " " " " " " " ROD 24 2 ROD " " " " " " " " RUB 25 2 RUB " " " " " " " " TgCATBr44 26 2 TgCATBr44 " " " " " " " " TgCATBr5 27 2 TgCATBr5 " " " " " " " " TgCATBr9 28 2 TgCATBr9 " " " " " " " " VAND 29 2 VAND " " " " " " " " VEG 30 2 VEG " " " " " " " " WTD3 31 2 WTD3 " " " " " " " " 72 hr? polyA+ No unknown 100+100? 164,728,174 28,962,965 17.6% 2,630,466 9.1% 31,824,251 22,950,642 72.1% 2,726,503 11.9% hr? polyA+ No 220 40+40 127,295,324 20,890,634 16.4% 1,607,718 7.7% 39,158,742 7,627,165 19.5% 465,197 6.1% 27,010,286 6,775,092 25.1% 446,632 6.6% 31,918,787 8,692,004 27.2% 593,508 6.8% 30,684,829 12,688,027 41.3% 772,439 6.1% 25,943,434 11,269,884 43.4% 698,953 6.2% 22,911,358 9,730,742 42.5% 646,186 6.6% 26,336,254 10,043,346 38.1% 630,228 6.3% 24,488,338 13,221,779 54.0% 791,392 6.0% 27,511,865 4,801,559 17.5% 345,916 7.2% 17,408,230 4,098,107 23.5% 298,438 7.3% 30,487,790 11,530,934 44.2% 919,285 6.8% 27,575,023 13,468,205 85.4% 765,443 3.3% 45,820,315 23,550,991 51.4% 1,347,240 5.7% 26,315,451 9,347,030 35.5% 600,355 6.4% 19,686,294 10,123,378 51.4% 611,495 6.0% 25,200,682 11,045,627 43.8% 653,377 5.9% 27,368,978 16,818,388 61.5% 1,034,813 6.2% 35,047,626 19,656,333 56.1% 1,140,965 5.8% 23,593,872 9,785,081 41.5% 616,145 6.3% 61,603,537 29,353,242 47.6% 1,824,046 6.2% 45,907,606 16,795,259 36.6% 1,315,830 7.8% 44,678,085 17,164,186 38.4% 1,114,844 6.5% 51,063,437 17,750,373 34.8% 1,118,740 6.3% 22,403,348 6,857,924 30.6% 413,230 6.0% 63,264,704 17,443,532 27.6% 1,372,374 7.9% 41,694,538 28,878,303 69.3% 1,846,138 6.4% 18,166,749 9,144,480 50.3% 598,049 6.5% 34,270,722 14,811,235 43.2% 937,868 6.3%

Intracellular Tachyzoites -- Developmental Series

Reid D3 32 3 VEG TachyzoiteHFF cells In vitro 72 hr polyA+ No 200-250 76+76 65,238,810 51,911,446 79.6% 4,096,024 7.9% Reid D4 33 3 " " " " 96 hr " " " " 77,988,674 63,623,584 81.6% 5,170,913 8.1% RH 2 hr 34 4 RH TachyzoiteHFF cells In vitro 2 hr polyA+ Yes 275-375 100 1,894,365 194,671 10.3% 35,875 18.4% RH 22 hr 35 4 " " " " 22 hr " " " " 14,927,266 8,930,207 59.8% 1,605,429 18.0% RH 36 hr 36 4 " " " " 36 hr " " " " 16,839,750 15,348,364 91.1% 2,209,144 14.4% GT1 2hr 37 4 GT1 TachyzoiteHFF cells In vitro 2 hr " " 275-375 " 3,918,897 364,309 9.3% 64,074 17.6% GT1 4hr 38 4 " " " " 4 hr " " " " 833,498 111,155 13.3% 23,529 21.2% GT1 8hr 39 4 " " " " 8 hr " " " " 2,453,961 514,210 21.0% 99,994 19.4% GT1 16hr 40 4 " " " " 16 hr " " 275-375 " 59,500,308 21,584,941 36.3% 3,712,516 17.2% ME49 2hr 41 4 ME49 TachyzoiteHFF cells In vitro 2 hr " " 55-150 " 60,290 4,250 7.0% 480 11.3% ME49 4hr 42 4 ME49 TachyzoiteHFF cells In vitro 4 hr " " " " 59,182,474 5,850,571 9.9% 523,097 8.9% ME49 8hr 43 4 " " " " 8 hr " " " " 42,147,632 5,833,363 13.8% 479,805 8.2% ME49 16hr 44 4 " " " " 16 hr " " " " 19,467,179 4,387,649 22.5% 403,446 9.2% ME49 36hr 45 4 " " " " 36 hr " " 275-375 " 19,313,474 10,951,950 56.7% 1,763,689 16.1% ME49 44hr 46 4 " " " " 44 hr " " " " 13,648,696 12,332,981 90.4% 1,722,462 14.0% VEG 2 hr 47 4 VEG TachyzoiteHFF cells In vitro 2 hr " " 55-150 " 116,612,403 13,005,698 11.2% 1,088,000 8.4% VEG 4 hr 48 4 " " " " 4 hr " " " " 58,188,707 7,680,123 13.2% 661,484 8.6% VEG 8 hr 49 4 " " " " 8 hr " " " " 57,496,732 9,139,881 15.9% 742,082 8.1% VEG 16 hr 50 4 " " " " 16 hr " " " " 26,968,837 6,543,629 24.3% 665,172 10.2% VEG 36 hr 51 4 " " " " 36 hr " " 275-375 " 22,772,838 12,418,256 54.5% 2,058,765 16.6% VEG 44 hr 52 4 " " " " 44 hr " " " " 14,768,871 8,826,301 59.8% 1,369,143 15.5% Bradyzoite development

Knoll acute mouse 53 5 ME49 TachyzoiteHFF cells In vitro 10 days polyA+ No unknown 50+50? 619,510,492 868,532 0.1% 51,040 5.9% Knoll chronic mouse 54 5 " Bradyzoite Mouse In vivo 28 days " " " " 662,065,868 1,441,169 0.2% 100,103 6.9% InVitro Bz 55 9 ME49 BradyzoiteHFF cells In vitro 7 days polyA+ Yes 55-150 50 29,120,884 26,411,628 90.7% 2,098,534 7.9% InVivo Bz 56 8 M4 Bradyzoite Mouse In vivo 21 days " " 275-375 100 113,123,601 27,297,107 24.1% 4,452,108 16.3%

Gametocyte development

Hehl Tz 57 9 CZ-H3 TachyzoiteHFF cells In vitro control polyA+ Yes unknown 100+100? 199,141,200 85,515,887 42.9% 10,548,750 12.3% Hehl D3 58 9 " GametocyteCat intestinal epitheliumIn vivo 3 days " " " " 552,571,698 12,947,597 2.3% 1,719,245 13.3% Hehl D5 59 9 " Gametocyte " " 5 days " " " " 195,225,948 81,478,584 41.7% 11,566,841 14.2% Hehl D7 60 9 " Gametocyte " " 7 days " " " " 751,346,454 128,435,940 17.1% 18,013,378 14.0%

Oocyst development

Oocyst D0 61 6 M4 Oocyst NA unsporulatedcontrol polyA+ Yes 55-150 58 22,416,214 10,037,743 44.8% 815,928 8.1% Oocyst D4 62 6 " " " sporulated 4 days " " 55-150 50 20,628,790 18,970,427 92.0% 1,508,317 8.0% Oocyst D10 63 6 " " " " 10 days " " 55-150 50 20,243,335 18,524,507 91.5% 1,427,680 7.7%

Other samples

SR3 uninduced 64 7 RH cSR3 TachyzoiteHFF cells In vitro control polyA+ No unknown 100+100? 102,451,076 94,223,358 85.7% 11,359,466 12.1% SR3 4hr induced 65 7 " " " SR3-induction 4 hr " " " " 89,992,092 82,711,487 91.9% 9,793,519 11.8% SR3 8hr induced 66 7 " " " " 8 hr " " " " 91,194,210 83,917,767 92.0% 10,126,194 12.1% SR3 24hr induced 67 7 " " " " 24 hr " " " " 96,979,952 87,825,880 90.6% 10,539,187 12.0% ncRNA RH 68 9 RH TachyzoiteHFF cells In vitro 33 hr ncRNA Yes 15-45 38 37,861,116 2,061,028 5.4% 127,575 6.2% ncRNA ME49 69 9 ME49 " " " 24 hr " " " " 34,330,519 2,064,798 6.0% 145,050 7.0%

References Total (all samples): 5,573,795,740 1,469,567,425 153,771,851

1. Lorenzi H et al., Nature Communications; 2016 (ref 7) excluding 'Other': 5,120,986,775 1,116,763,107 111,680,860

2. Minot S et al., Proc Natl Acad Sci; 2012 (ref 18) Unigue intron junctions: 2,731,523

3. Reid AJ et al., PLoS Pathogens; 2012 (19) High quality strand-specific libraries (green): 2,354,385,967 506,589,822 63,627,926

4. This study

5. Pittman KJ et al., BMC Genomics; 2014 (20)

6. Fritz HM et al., PLoS1; 2012 (17) * read fragments for paired-end libraries (to avoid couble-counting); denominator for FPKM calculations 7. Yeoh LM et al., Nucl Acids Res; 2015 (21) † low % unique mapping reads is attributable to host cell RNA; cf ME49 time course in rows 42-46) 8. Buchholz KR et al. Eukaryot Cell; 2011 (51) ‡ total # intron-spanning reads (ISRs); denominator for ISRPM calculations 9. Unpublished; available from ToxoDB.org

Table 1. List of T. gondii RNAseq datasets used in this study (ToxoDB release 28). Green highlighting indicates 20 high quality samples used define the prevalence of alternative splicing and mechanisms of transcriptional regulation; pink highlighting indicates reasons for exclusion of other samples from the analyses presented in Chapter 2; see text for further discussion.

Because strand-mapping information was not retained in the ToxoDB pipeline implementation of RUM, strandedness was assigned by analyzing five nucleotides up- and downstream of each intron to determine the most probably splice donor and acceptor. Analysis of the abundance distribution of dinucleotides pairs for each intron, showed that (as expected) the most common splice signal (on the plus strand) was 5’- GT/AG-3’, which is >80 times more abundant than any of the other possible 63 intron combinations. 5’-GC/AG-3’ and 5’-GA/AG-3’, were the next most common (enriched 1.8 & 1.4 times, respectively). We therefore further filtered this intron list to include only introns that contained the major splice signal 5’GT/AG3’. This procedure yields a total of 66,104 introns for examination in greater detail.

Once strand was recovered, introns were assigned to gene structures to determine those fully contained within a previously-annotated gene model, those lying fully within intergenic regions in the draft annotation (potential extensions of existing gene models, or associated with previously unrecognized genes), and those overlapping draft gene models. For introns associated with a specific gene, reads spanning that intron should be comparable to reads mapping to the mature transcript. The abundance of intron- spanning reads (ISRs) per million reads in the library (ISRPM) was therefore plotted as a function of the number of reads mapping to the assigned gene, normalized to gene size (FPKM = read fragments per kilobase of transcript, per million total mapped reads).

Preliminary analysis using all 69 RNA-seq samples in Table I revealed poor corre- lation between intron and gene expression for a some introns, invariably attributable to poor quality samples in which relatively few reads could be mapped to the parasite (or host) genome. These samples were therefore excluded from analysis of splice junction usage, as were samples for which only non-strand-specific reads are available, and samples involving splicing machinery mutants (pink shading in Table1). Note, however, that data from all these samples remains available in ToxoDB, and were reviewed after the completion of our analysis of high confidence introns.

All further analysis was conducted using a final set of 59,755 introns, from 20 samples representing all parasite life cycle stages, in multiple strains (green shading in Table 1). To analyze the prevalence of alternative splicing, the abundance of each intron was compared to its most abundant alternative(s), if any. Data was visualized using

DataGraph 4.1 (Visual Data Tools; Figs 2.4-2.6 & 2.12-2.13).

Results

Transcriptional insights from RNA-seq, applied to Toxoplasma gondii

Prior to the development of high throughput methods for mRNA sequencing, eukaryotic gene finding relied upon on ab initio methods (predicting gene structure based on primary sequence alone), and de novo strategies informed by cDNA sequences from expressed sequence tags (ESTs; Burge and Karlin 1998; Salamov and Solovyev 2000; Wei and Brent 2006). The much greater coverage provided by low cost RNA-seq methods greatly enhances gene model accuracy, however, enabling the identification of previously unrecognized transcripts, refinement of untranslated region (UTR) annotation,

definition of stage- and strain-specific transcripts, recognition of alternative splice junctions, etc (Trapnell et al. 2010). When available, additional genomic-scale datasets (chromatin marks, transcription factor binding sites, protein expression data, etc) can also be exploited further improve the accuracy of gene model prediction (Lamesch et al. 2012).

TgHXGPRT was the first alternatively-spliced gene identified in the protozoan parasite Toxoplasma gondii (Donald et al. 1996), based on the presence or absence of an 'exon skip' polymorphism encoding a 49 amino acid insertion including an acylation motif responsible for protein association with parasite membranes (Chaudhary et al. 2005). The reference model in the official GenBank annotation (and ToxoDB) includes this exon-skip polymorphism as exon III (Fig 2.1 track 1; blue indicates transcription from left- to-right, i.e. on the forward, or top strand). Numerous ESTs map to TgHXGPRT (track 2), confirming both splice variants: exon III is missing from eleven ESTs (HXGPRT-I), but included in six (HXGPRT-II). Western blotting indicates similar relative abundance of HXGPRT-I vs HXGPRT-II at the protein level (Chaudhary et al. 2005). This well-validated example of an alternatively-spliced gene was used as a positive control to define and assess parameters for alternative transcript identification genome-wide.

RNA-seq data provide vastly greater experimental support: average depth for the experiment presented in Fig 2.1 is >700 reads (track 3; ~29.5 on the log scale plot shown in track 5). The most common apparent transcript initiation site (in this steady-state analysis) occurs at ~6,795,950 (heavy magenta arrow), ~100 nt downstream of the annotated 5’ end. In keeping with common conventions from the pre-RNA-seq era, this gene was originally annotated based on the longest, rather than the most abundant

cDNA clone. Log-scale representation reveals the range of 5’ ends identified by RNA- seq, although we cannot exclude the possibility of 5’ exonuclease activity or premature termination during reverse transcription. Most transcripts appear to terminate close to the annotated 3’ end (filled magenta circle), although alternative low abundance termination sites are also evident, most prominently ~500 nt downstream (open magenta circle).

TgME49_chrVIII

1 2 3 4 5

Annotated Genes (UTRs in gray) EST Alignments

mRNAseq Coverage – TgCZ-H3 Tachyzoites (Hehl) (linear plot)

Splice Site JuncMons (union of all experiments)

mRNAseq Coverage – TgCZ-H3 Tachyzoites (Hehl) (log plot)

I II III IV V

Figure 2.1. Reading RNA-seq data.

The annotated T. gondii HXGPRT gene (TgME49_200320), aligned to EST sequences and RNA- seq data (both linear & log coverage plots), including intron-spanning reads (brackets). Note the presence of multiple 5’ ends (at steady-state), likely corresponding to multiple promoters

(magenta arrows), of which only the longest and most prominent would permit excision of intron I (which lies within the 5’UTR). EST coverage in intron I was previously misinterpreted as intron read-through variants (Donald et al. 1996). Magenta brackets highlight the well-validated exon skip variant (star) responsible for membrane association of HXGPRT isoform II (Donald et al. 1996; Chaudhary et al. 2005). Multiple (low abundance) 3’ ends are also observed (circles).

Six well-defined exons are clearly identifiable in the RNA-seq coverage plots, corresponding precisely to the annotatation TgHXGPRT-II. Exon III (magenta star) is less abundant than the other five, however (seen most clearly in the linear representation; track 3), providing evidence of alternative splicing, consistent with ESTs, Northern blotting, protein immunoprecipitation, proteomics and protein structure data (ToxoDB.org and (Chaudhary et al. 2005)). RNA-seq reads that map across intron junctions (intron- spanning reads; ISRs) are indicated by horizontal brackets in track 4 (pooled information from numerous experiments). Magenta brackets highlight introns corresponding to the known HXGPRT exon-skip polymorphism: for the experiment shown (TgCZ-H3 tachyzoites), 149 reads could be unequivocally mapped to intron II, 187 reads span intron III, and 180 reads span introns II+III (excising exon III), defining the HXGPRT-I transcript. Pooling all available experimental data (from multiple samples) provides overwhelming support for this exon-skip polymorphism (8573, 11475, and 12785 reads, respectively; ToxoDB release 28).

Intron-spanning reads (ISRs) also identify numerous unannotated introns, but these are significantly less abundant. For example, while intron I is supported by 391 reads in the experiment shown (27K in all studies), an alternative splice donor 112 nt upstream is supported by 7 ISRs and 227 ISRs respectively, and this alternative intron is also

supported by EST data; two ISRs (61) support yet another alternative donor 8 nt further upstream. Note that none of these alternatives affects the predicted protein sequence, however, as intron I lies within the 5’ untranslated region (UTR). Alternative ISRs mapping to the HXGPRT coding sequence also seem unlikely to be biologically meaningful, as the most common (87 reads in all studies, but none in the experiment shown) extends

intron IV by 17 nt, which would introduce a translational frame shift and premature termination eliminating most of the phosphoribosyl transferase domain (Pfam00156).

In addition to defining the exon skip polymorphism distinguishing HXGPRT-I & II, the original cDNA clones were interpreted to suggest retention of intron I in some transcripts (Chaudhary et al. 2005). Read coverage within intron I is significantly higher than other introns (~10% exon depth vs <5% for other introns), but careful examination of the RNA-seq data reveals that coverage is non-uniform, displaying gradually increasing depth in the direction of transcription, suggesting alternative promoters (lighter magenta

arrows). This interpretation is consistent with both EST evidence and review of the

original cDNA clones. Transcript initiation within intron I would of course preclude intron excision.

In sum, using the highly-curated TgHXGPRT gene as a positive control for reanal- ysis based on RNA-seq data improves the definition of UTRs, confirms exon boundary annotation and a known exon-skip variant, permits identification of additional rare (and probably biologically meaningless) splice variants, and reveals that the previously- described intron retention is more likely attributable to alternative transcript initiation within intron I. Applying such analyses genome-wide offers the prospect of significantly improved gene model definition. For example, as shown in Fig 2.2, previous annotation (ToxoDB release 7.3, produced without the benefit of RNA-seq data) failed to define 5’ UTRs for 5777 of the 8323 protein-coding genes in the reference T. gondii annotation, and 3’ UTRs for 4859 (gray bars). Incorporating RNA-seq information now identifies 7219 5’ UTRs and 7296 3’ UTRs, with a modal 5’ UTR length of ~750 nt, and 3’ UTR length of ~500, similar to HXGPRT, above (blue in Fig 2.2).

Fig 2.3 presents an expanded genomic region, extending upstream of HXGPRT (the right-most gene in this panel), including RNA-seq evidence from both tachyzoite and gametocyte (enteroepithelial) stage parasites (Hehl et al. 2015), along with additional genomic-scale data from chromatin immunoprecipitation studies (Gissot et al. 2007). These experiments support the reference annotation of TgME49_200310 (immediately to the left of HXGPRT), which is heavily transcribed in tachyzoites (tracks 5,6,8), gametocytes (track 10), and other life cycle stages (not shown). Low abundance unannotated ISRs never exceed 2% of annotated intron abundance for this gene. Chromatin activa- tion marks (H3K4me3 & H3K9ac; track 1) are consistent with a 242 nt region mediating divergent transcription of TgME49_200310 & 200320 (HXGPRT).

Figure 2.2. Length distribu- tion of annotated UTRs. Lengths of UTRs before (gray) and after (blue) incorporating RNA-seq data into gene finding algorithms.

5’ UTR annota+on 3’ UTR annota+on

6000 genes 3000 1500 1000 500 0 1 Kb 0 0 1 Kb 2 Kb 2 Kb 2000 4000 5000

31 Intron-spanning reads mRNAseq (log scale) Chromatin marks Annotation Strand-specific mRNA-seq

6.778 6.780 6.782 6.784 6.786 6.788 6.790 6.792 6.794 6.796 6.798 6.800 Mb 1 2 3 4 5 6 7 8 9 10

Figure 2.3. Strand-specific mRNA-seq reveals a multitude of alternative splice variants and antisense RNAs, including long non-coding RNAs.

Genome browser view of a 22kb region (TgME49_chrVIII: 6778-6800kb) displaying publicly available chromatin marks (H3K4me3 & H3K9ac, track 1), annotated gene models (including HXGPRT, at right, track 3), non strand-specific mRNA-seq data (tracks 5,6), and selected strand- specific mRNA-seq data from this study (tracks 7-10). Color intensity in track 4 indicates 10-fold

In document Transcript Diversity In The Protozoan Parasite Toxoplasma Gondii (Page 33-73)