• No results found

2.5 Alternative Splicing Event Detection and Quantification

2.5.7 Handling of Multiple Input Files

In all descriptions above, we only discussed how a single sample is used as input. SplAdder is capable of integrating the information of several input files. This is necessary in scenarios with several replicates per sample or if the splicing variation of a whole set of samples should be integrated. We distinguish four different modes to use multiple input files:

1. Single Graphs: This mode treats each input file independently and generates the same result as if running SplAdder on each file in a serial manner. That is, all steps are performed on each file, generating one result file per input file.

2. Merge Graphs: This mode generates an augmented splice graph for each input file but integrates all these graphs into a common graph representation. It further allows for filtering of the graph, to only retain nodes and edges that are confirmed in a certain fraction of input samples. Events are then detected on the common graph representation but quantified for each input file separately.

3. Merge Files: Here, all input files are treated es replicates, merging their information. That is, only a single augmented splice graph is constructed and used for event calling. For the quantification of the splicing events, evidence from all input files is merged.

4. Merge All: This is a hybrid mode between merge graphs and merge files. It generates one splice graph per input file as well as a graph constructed from all files at once. All graphs are then integrated into a common representation that is used for event calling. The quantification of events is then performed on the single files again. These modes can also be combined to create hybrid-strategies, e.g., to create a common splicing graph by merging all input files but quantify each file separately. This can be achieved by providing result files from intermediate steps as input to SplAdder and change the options for the remaining steps.

2.5.8 Implementation and Software

SplAdder has been implemented initially in Matlab/Octave code and was packaged with shell scripts to provide a command line user interface, with no further dependencies than Matlab or Octave. A newer implementation is now also available in Python, resolving the dependency from the Matlab/Octave computing environment. SplAdder relies on standard input formats for the gene annotation (GFF3 format) and the alignments (BAM format). Outputs are provided as plain text files or in HDF5 format. The user interface of the Mat- lab/Octave implementation showing all available functionality is shown in Appendix A.5. The source code for both implementations is published under GPL license and is publicly available under https://github.com/ratschlab/spladder.

In this chapter, we discuss four different projects in which the methods described in the previous chapter have been applied to data from various biological experiments. The first three projects deal with sequencing data from the model plant Arabidopsis thaliana, whereas the last project involves data from different human cancer samples. In all A. thaliana related projects, we used PALMapper for the alignment and SplAdder for the annotation and quantification of alternative splicing events. The first section discusses the role of alternative splicing in context of the post-transcriptional regulation mechanism of nonsense mediated mRNA-decay (NMD). Based on data from knockdown mutants, we assessed how many alternatively-spliced transcripts are subject to this degradation mechanism [68]. In the second section, we describe our work on A. thaliana plants with a mutations in polyrimidine tract binding protein homologs (PTBs) that led to aberrations in splicing patterns and thus revealed functional roles of PTBs for the splicing of flowering regulators [246]. In the subsequent section, we describe the results of a large-scale analysis of two A. thaliana populations grown at different temperatures with the aim to identify expression and splicing quantitative trait loci (eQTL and sQTL, respectively) and investigate their effects within different environments. The work described in the last section discusses the analysis of whole transcriptome sequencing samples of more than 4,000 cancer patients. We used SplAdder in a large scale manner to identify and quantify alternative splicing events as phenotypes that were then associated with somatic as well as germline genetic alterations in these patients.

Author Contributions All studies described in this chapter have been conducted in

collaborations of either small groups or within larger multi-institutional consortia. Here, we describe which parts where genuinely contributed by the author of this work (AK) and how the remaining work was split. The two studies on NMD and PTB were con- ducted in collaboration with Andreas Wachter (AW), Gabriele Drechsel (GD), Christina

R¨uhl (CR), Eva Stauffer (ES), Anil K. Kesarwani (AKK), Jonas Behr (JB), Philipp Drewe

(PD), Gabriele Wagner (GW) and Gunnar R¨atsch (GR). For the research on NMD, AW,

GD, AK and GR designed the project, GD, AKK, ES and AW carried out biological exper- iments and provided the data, AK and GR conceived the computational analysis strategy, AK implemented and designed the computational analysis pipeline, carried out all RNA-Seq alignments, performed alternative event quantification and differential analysis and imple- mented, performed the NMD feature analysis and carried out functional analyses on the candidate events, PD contributed to the adaptation of rDiff and JB provided predictions of non-coding and intergenic transcripts. For the research on PTB, GR, CR, ES and AW designed the experimental setup, CR, ES, GW, GD and AW performed the biological exper- iments and provided the data, GR and AK conceived the computational analysis strategy, AK implemented and designed the analysis pipeline, performed all RNA-Seq alignments and data quality controls, characterized alternative splicing events and performed the dif- ferential analysis. Both studies resulted in peer-reviewed publications [68, 246].

The presented work on detecting sQTL in two populations of A. thaliana is part of an international collaboration with multiple other groups. We will only mention people relevant for the work presented here: Magnus Nordborg (MN), Pei Zhang (PZ), Richard M. Clark (RMC), Robert Greenhalgh (RG), Edward J Osborne (EJO), Bjarni Vilhjalmsson (BV),

Oliver Stegle (OS), Philipp Drewe (PD), Yi Zhong (YZ) and Gunnar R¨atsch (GR). The data

for the CEGS population was provided by MN and PZ, whereas the data for the MAGIC population was provided by RMC, EJO and RG. In both cases this included collection of biological material, preparation and sequencing. MN, RMC, GR and OS conceived the idea of the study. AK and GR designed the alignment pipeline. AK implemented and processed the RNA-Seq alignment, implemented and processed the alternative event detection and quantification, performed read counting and filtering for the expression analysis and carried out the sQTL analyses. YZ helped with a parameter study for the alignment and PD suggested filtering criteria for the expression counting. OS and BV provided code that was used for the linear mixed model analysis. EJO implemented and carried out the eQTL analyses.

The analysis of 12 different cancer types to detect eQTL and sQTL and determine splicing aberrations in cancer was a collaborative effort together with Kjong-Van Lehmann (KL),

Gunnar R¨atsch (GR), Cyriac Kandoth (CK), William Lee (WL), Nikolaus Schultz (NS),

Oliver Stegle (OS) and The Cancer Genome Atlas research network (TCGA). All raw sequencing data was provided by TCGA. KL, GR, OS and AK conceived the study. GR and AK performed alignments and carried out alignment quality control. KL, CK, WL and AK performed variant calling. KL and AK designed and implemented the full data processing and association pipeline. AK implemented the detection and quantification of alternative splicing events, carried out the necessary quality filtering, implemented the pipeline to generate gene expression counts used for the eQTL analysis and performed the analysis of alternative splicing diversity over cancer types. OS provided efficient low-level code for the mixed model analysis used in the sQTL and eQTL analyses. NS provided a comprehensive list of cancer relevant genes.

3.1 Evaluation of Nonsense-mediated mRNA-Decay in

Arabidopsis thaliana

Regulation of transcription is a complex process that not only involves various protein and RNA factors during mRNA synthesis but also processes that degrade transcriptional prod- ucts. The most important mRNA degradation mechanism is nonsense mediated mRNA- decay (NMD, cf. Section 1.1), that not only helps to degrade products from pseudo- genes [202], transposons [199] or certain non-coding RNAs [149] but also plays an im- portant role in the degradation of physiological transcripts resulting in a major regulatory potential. Numerous factors involved in NMD have been identified over the past years. A small set of proteins was found to be conserved over almost all eukaryotic species: the UP FRAMESHIFT proteins UPF1, UPF2 and UPF3. To investigate the role of NMD in transcriptional regulation and further understand how it can be triggered by alternative splicing, we created A. thaliana plants lacking essential NMD factors and studied the effect onto the transcriptome. Although single NMD factors have been knocked out in other or- ganisms [311], neither existed a study in which several NMD factors had been knocked out

nor had a knockout-study with whole-transcriptome assessment been conducted in plants. In the following, we give a detailed description of our study design, the setup of our compu- tational pipeline and the results of our analysis. However, we will focus on the application of the computational methods described earlier, as these are the contributions of the author. For a more detailed introduction into the NMD mechanism and its biological relevance and further details for biological methodology, we refer to our own work [68] as well as to three excellent reviews [45, 51, 186].