Generation of transcript counts from pasilla dataset with kallisto

(1)

1[email protected]

Malgorzata Nowicka

¹

, Mark Robinson

May 20, 2021

This vignette describes version 1.20.0 of the PasillaTranscriptExpr package.

1 Description of pasilla dataset

. . . 1

2 Required software

. . . 2

3 Downloading the pasilla data

. . . 2

4 Downloading the reference genome

. . . 3

5 Transcript quantification with kallisto

. . . 3

APPENDIX

. . . 7

A Session information

. . . 7

B References

. . . 8

1 Description of pasilla dataset

The pasilla dataset was produced by Brooks et al. [1]. The aim of their study was

to identify exons that are regulated by pasilla protein, the Drosophila melanogaster

ortholog of mammalian NOVA1 and NOVA2 (well studied splicing factors). In their

RNA-seq experiment, the libraries were prepared from 7 biologically independent

samples: 4 control samples and 3 samples in which pasilla was knocked-down. The

libraries were sequenced on the Illumina Genome Analyzer II using single-end and

paired-end sequencing and different read lengths. The RNA-seq data can be down-

loaded from the NCBI’s Gene Expression Omnibus (GEO) under the accession number

GSE18508.

(2)

2 Required software

This work-flow can be run on a Unix-like operating system, i.e., Linux or MacOS X with bash shell. All commands, including the one that could be run from termi- nal window, are run from within R using

system()

function. The downloaded and generated files will be saved in the current working directory.

Brooks et al. deposited their data in the Short Read Archive. In order to convert SRA data into fastq format, you need to install the SRA toolkit available on http:

//www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software.

For the transcript quantification, we use kallisto version 0.42.1 [2], which is an ex- tremely fast program that quantifies abundances of transcripts. kallisto is based on the novel idea of pseudoalignment to rapidly determine the compatibility of reads with transcripts, without the need for alignment. Thus it works directly on fastq files. The quantification is available in transcripts per million (TPM) and in expected counts. In this package, we make available the expected counts. kallisto can be downloaded from http://pachterlab.github.io/kallisto/.

3 Downloading the pasilla data

We use an automated process to download the SRA files that correspond to 4 control (Untreated) samples and 3 pasilla knocked-down (CG8144_RNAi) samples. All the information about the pasilla assay can be found in the metadata file SraRunInfo.csv, which can be downloaded from http://www.ncbi.nlm.nih.gov/sra?term=SRP001537 under Send to: → File → RunInfo → Create File. The same file is also available within this package in the

extdata

directory.

library(PasillaTranscriptExpr)

data_dir <- system.file("extdata", package = "PasillaTranscriptExpr")

sri <- read.table(paste0(data_dir, "/SraRunInfo.csv"), stringsAsFactors = FALSE, sep = ",", header = TRUE)

keep <- grep("CG8144|Untreated-", sri$LibraryName) sri <- sri[keep, ]

sra_files <- basename(sri$download_path)

for(i in 1:nrow(sri))

download.file(sri$download_path[i], sra_files[i])

To convert the SRA files to fastq format, we use the fastq-dump command from the

SRA toolkit. Then, we compress the fastq files.

(3)

cmd <- paste0("fastq-dump -O ./ --split-3 ", sra_files)

for(i in 1:length(cmd)) system(cmd[i])

system("gzip *.fastq")

4 Downloading the reference genome

To run kallisto, you need to download a FASTA formatted file of target sequences:

system("wget ftp://ftp.ensembl.org/pub/release-70/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.70.cdna.all.fa.gz") system("gunzip Drosophila_melanogaster.BDGP5.70.cdna.all.fa.gz")

The output produced by kallisto contains only the transcript IDs. To add the corre- sponding gene IDs, we need to download the gene model annotation in GTF format:

system("wget ftp://ftp.ensembl.org/pub/release-70/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.70.gtf.gz") system("gunzip Drosophila_melanogaster.BDGP5.70.gtf.gz")

5 Transcript quantification with kallisto

We create a metadata file where each row corresponds to a collection of information needed for a single call of kallisto. The pasilla data consists of paired-end and single- end samples. When you run kallisto on single-end reads, you have to specify an

-l

option which defines the average fragment length. It can be found in

sri$avgLength

. There is one sample (GSM461179) which was sequenced using different read lengths.

Therefore, for this sample, we do the transcript quantification for each read length separately and we add the resulting transcript counts in another step.

sri$LibraryName <- gsub("S2_DRSC_","",sri$LibraryName)

metadata <- unique(sri[,c("LibraryName", "LibraryLayout", "SampleName",

"avgLength")])

for(i in seq_len(nrow(metadata))){

indx <- sri$LibraryName == metadata$LibraryName[i]

if(metadata$LibraryLayout[i] == "PAIRED"){

metadata$fastq[i] <- paste0(sri$Run[indx], "_1.fastq.gz ", sri$Run[indx], "_2.fastq.gz", collapse = " ")

}else{

metadata$fastq[i] <- paste0(sri$Run[indx], ".fastq.gz", collapse = " ") }

(4)

}

metadata$condition <- ifelse(grepl("CG8144_RNAi", metadata$LibraryName),

"KD", "CTL")

metadata$UniqueName <- paste0(1:nrow(metadata), "_", metadata$SampleName)

In the first step of kallisto work-flow, we build an index with kallisto index :

cDNA_fasta <- "Drosophila_melanogaster.BDGP5.70.cdna.all.fa"

index <- "Drosophila_melanogaster.BDGP5.70.cdna.all.idx"

cmd <- paste("kallisto index -i", index, cDNA_fasta, sep = " ") cmd

## [1] "kallisto index -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx Drosophila_melanogaster.BDGP5.70.cdna.all.fa"

system(cmd)

The quantification is done with kallisto quant command:

out_dir <- metadata$UniqueName

cmd <- paste("kallisto quant -i", index, "-o", out_dir, "-b 0 -t 5", ifelse(metadata$LibraryLayout == "SINGLE",

paste("--single -l", metadata$avgLength), ""), metadata$fastq) cmd

## [1] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 1_GSM461176 -b 0 -t 5 --single -l 45 SRR031708.fastq.gz SRR031709.fastq.gz SRR031710.fastq.gz SRR031711.fastq.gz SRR031712.fastq.gz SRR031713.fastq.gz"

## [2] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 2_GSM461177 -b 0 -t 5 SRR031714_1.fastq.gz SRR031714_2.fastq.gz SRR031715_1.fastq.gz SRR031715_2.fastq.gz"

## [9] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 9_GSM461182 -b 0 -t 5 --single -l 75 SRR031728.fastq.gz SRR031729.fastq.gz"

for(i in 1:length(cmd)) system(cmd[i])

We want to add the gene information and merge the expected transcript counts from different samples into one table.

library(rtracklayer)

gtf_dir <- "Drosophila_melanogaster.BDGP5.70.gtf"

(5)

gtf <- import(gtf_dir)

gt <- unique(mcols(gtf)[, c("gene_id", "transcript_id")]) rownames(gt) <- gt$transcript_id

samples <- unique(metadata$SampleName)

counts_list <- lapply(1:length(samples), function(i){

indx <- which(metadata$SampleName == samples[i])

if(length(indx) == 1){

abundance <- read.table(file.path(metadata$UniqueName[indx],

"abundance.txt"), header = TRUE, sep = "\t", as.is = TRUE) }else{

abundance <- lapply(indx, function(j){

abundance_tmp <- read.table(file.path(metadata$UniqueName[j],

"abundance.txt"), header = TRUE, sep = "\t", as.is = TRUE) abundance_tmp <- abundance_tmp[, c("target_id", "est_counts")]

abundance_tmp })

abundance <- Reduce(function(...) merge(..., by = "target_id", all = TRUE, sort = FALSE), abundance)

est_counts <- rowSums(abundance[, -1])

abundance <- data.frame(target_id = abundance$target_id, est_counts = est_counts, stringsAsFactors = FALSE) }

counts <- data.frame(abundance$target_id, abundance$est_counts, stringsAsFactors = FALSE)

colnames(counts) <- c("feature_id", samples[i]) return(counts)

})

counts <- Reduce(function(...) merge(..., by = "feature_id", all = TRUE, sort = FALSE), counts_list)

### Add gene IDs

counts$gene_id <- gt[counts$feature_id, "gene_id"]

At the end, we keep only the unique samples in our metadata file.

metadata <- unique(metadata[, c("LibraryName", "LibraryLayout", "SampleName",

"condition")]) metadata

## LibraryName LibraryLayout SampleName condition

(6)

## 156 Untreated-1 SINGLE GSM461176 CTL

## 162 Untreated-3 PAIRED GSM461177 CTL

## 164 Untreated-4 PAIRED GSM461178 CTL

## 166 CG8144_RNAi-1 SINGLE GSM461179 KD

## 172 CG8144_RNAi-3 PAIRED GSM461180 KD

## 174 CG8144_RNAi-4 PAIRED GSM461181 KD

## 176 Untreated-6 SINGLE GSM461182 CTL

write.table(metadata, "metadata.txt", quote = FALSE, sep = "\t", row.names = FALSE)

### Final counts with columns sorted as in metadata

counts <- counts[, c("feature_id", "gene_id", metadata$SampleName)]

write.table(counts, "counts.txt", quote = FALSE, sep = "\t", row.names = FALSE)

(7)

APPENDIX

A Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)

## Platform: x86_64-pc-linux-gnu (64-bit)

## Running under: Ubuntu 20.04.2 LTS

##

## Matrix products: default

## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so

## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so

##

## locale:

## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C

## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C

## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8

## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C

## [9] LC_ADDRESS=C LC_TELEPHONE=C

## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

##

## attached base packages:

## [1] parallel stats4 stats graphics grDevices utils datasets

## [8] methods base

##

## other attached packages:

## [1] rtracklayer_1.52.0 GenomicRanges_1.44.0

## [3] GenomeInfoDb_1.28.0 IRanges_2.26.0

## [5] S4Vectors_0.30.0 BiocGenerics_0.38.0

## [7] PasillaTranscriptExpr_1.20.0 knitr_1.33

##

## loaded via a namespace (and not attached):

## [1] compiler_4.1.0 BiocManager_1.30.15

## [3] restfulr_0.0.13 highr_0.9

## [5] XVector_0.32.0 MatrixGenerics_1.4.0

## [7] bitops_1.0-7 tools_4.1.0

## [9] zlibbioc_1.38.0 digest_0.6.27

## [11] lattice_0.20-44 evaluate_0.14

## [13] rlang_0.4.11 Matrix_1.3-3

## [15] DelayedArray_0.18.0 yaml_2.2.1

## [17] xfun_0.23 GenomeInfoDbData_1.2.6

## [19] stringr_1.4.0 Biostrings_2.60.0

## [21] grid_4.1.0 Biobase_2.52.0

## [23] XML_3.99-0.6 BiocParallel_1.26.0

(8)

## [27] Rsamtools_2.8.0 codetools_0.2-18

## [29] htmltools_0.5.1.1 matrixStats_0.58.0

## [31] GenomicAlignments_1.28.0 SummarizedExperiment_1.22.0

## [33] BiocStyle_2.20.0 stringi_1.6.2

## [35] RCurl_1.98-1.3 crayon_1.4.1

## [37] rjson_0.2.20 BiocIO_1.2.0

B References References

[1] A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E.

Brenner, and B. R. Graveley, “Conservation of an RNA regulatory map between Drosophila and mammals.,” Genome research, vol. 21, no. 2, pp. 193–202, 2011.

Generation of transcript counts from pasilla dataset with kallisto

Malgorzata Nowicka

, Mark Robinson

May 20, 2021

This vignette describes version 1.20.0 of the PasillaTranscriptExpr package.

Contents

1 Description of pasilla dataset

2 Required software

3 Downloading the pasilla data

4 Downloading the reference genome

5 Transcript quantification with kallisto

APPENDIX

A Session information

B References

1 Description of pasilla dataset

The pasilla dataset was produced by Brooks et al. [1]. The aim of their study was

to identify exons that are regulated by pasilla protein, the Drosophila melanogaster

ortholog of mammalian NOVA1 and NOVA2 (well studied splicing factors). In their

RNA-seq experiment, the libraries were prepared from 7 biologically independent

samples: 4 control samples and 3 samples in which pasilla was knocked-down. The

libraries were sequenced on the Illumina Genome Analyzer II using single-end and

paired-end sequencing and different read lengths. The RNA-seq data can be down-

loaded from the NCBI’s Gene Expression Omnibus (GEO) under the accession number

GSE18508.

2 Required software

This work-flow can be run on a Unix-like operating system, i.e., Linux or MacOS X with bash shell. All commands, including the one that could be run from termi- nal window, are run from within R using

function. The downloaded and generated files will be saved in the current working directory.

Brooks et al. deposited their data in the Short Read Archive. In order to convert SRA data into fastq format, you need to install the SRA toolkit available on http:

//www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software.

3 Downloading the pasilla data

directory.

To convert the SRA files to fastq format, we use the fastq-dump command from the

SRA toolkit. Then, we compress the fastq files.

4 Downloading the reference genome

To run kallisto, you need to download a FASTA formatted file of target sequences:

The output produced by kallisto contains only the transcript IDs. To add the corre- sponding gene IDs, we need to download the gene model annotation in GTF format:

5 Transcript quantification with kallisto

We create a metadata file where each row corresponds to a collection of information needed for a single call of kallisto. The pasilla data consists of paired-end and single- end samples. When you run kallisto on single-end reads, you have to specify an

option which defines the average fragment length. It can be found in

. There is one sample (GSM461179) which was sequenced using different read lengths.

Therefore, for this sample, we do the transcript quantification for each read length separately and we add the resulting transcript counts in another step.

In the first step of kallisto work-flow, we build an index with kallisto index :

The quantification is done with kallisto quant command:

We want to add the gene information and merge the expected transcript counts from different samples into one table.

At the end, we keep only the unique samples in our metadata file.

APPENDIX

A Session information

B References References

[1] A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E.

Brenner, and B. R. Graveley, “Conservation of an RNA regulatory map between Drosophila and mammals.,” Genome research, vol. 21, no. 2, pp. 193–202, 2011.

[2] N. L. Bray, H. Pimentel, P. Melsted, and L. Pachter, “Near-optimal RNA-Seq

quantification.”.