• No results found

Generation of transcript counts from pasilla dataset with kallisto

N/A
N/A
Protected

Academic year: 2021

Share "Generation of transcript counts from pasilla dataset with kallisto"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

1[email protected]

Malgorzata Nowicka

1

, Mark Robinson

May 20, 2021

This vignette describes version 1.20.0 of the PasillaTranscriptExpr package.

Contents

1 Description of pasilla dataset

. . . 1

2 Required software

. . . 2

3 Downloading the pasilla data

. . . 2

4 Downloading the reference genome

. . . 3

5 Transcript quantification with kallisto

. . . 3

APPENDIX

. . . 7

A Session information

. . . 7

B References

. . . 8

1 Description of pasilla dataset

The pasilla dataset was produced by Brooks et al. [1]. The aim of their study was

to identify exons that are regulated by pasilla protein, the Drosophila melanogaster

ortholog of mammalian NOVA1 and NOVA2 (well studied splicing factors). In their

RNA-seq experiment, the libraries were prepared from 7 biologically independent

samples: 4 control samples and 3 samples in which pasilla was knocked-down. The

libraries were sequenced on the Illumina Genome Analyzer II using single-end and

paired-end sequencing and different read lengths. The RNA-seq data can be down-

loaded from the NCBI’s Gene Expression Omnibus (GEO) under the accession number

GSE18508.

(2)

2 Required software

This work-flow can be run on a Unix-like operating system, i.e., Linux or MacOS X with bash shell. All commands, including the one that could be run from termi- nal window, are run from within R using

system()

function. The downloaded and generated files will be saved in the current working directory.

Brooks et al. deposited their data in the Short Read Archive. In order to convert SRA data into fastq format, you need to install the SRA toolkit available on http:

//www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software.

For the transcript quantification, we use kallisto version 0.42.1 [2], which is an ex- tremely fast program that quantifies abundances of transcripts. kallisto is based on the novel idea of pseudoalignment to rapidly determine the compatibility of reads with transcripts, without the need for alignment. Thus it works directly on fastq files. The quantification is available in transcripts per million (TPM) and in expected counts. In this package, we make available the expected counts. kallisto can be downloaded from http://pachterlab.github.io/kallisto/.

3 Downloading the pasilla data

We use an automated process to download the SRA files that correspond to 4 control (Untreated) samples and 3 pasilla knocked-down (CG8144_RNAi) samples. All the information about the pasilla assay can be found in the metadata file SraRunInfo.csv, which can be downloaded from http://www.ncbi.nlm.nih.gov/sra?term=SRP001537 under Send to: → File → RunInfo → Create File. The same file is also available within this package in the

extdata

directory.

library(PasillaTranscriptExpr)

data_dir <- system.file("extdata", package = "PasillaTranscriptExpr")

sri <- read.table(paste0(data_dir, "/SraRunInfo.csv"), stringsAsFactors = FALSE, sep = ",", header = TRUE)

keep <- grep("CG8144|Untreated-", sri$LibraryName) sri <- sri[keep, ]

sra_files <- basename(sri$download_path)

for(i in 1:nrow(sri))

download.file(sri$download_path[i], sra_files[i])

To convert the SRA files to fastq format, we use the fastq-dump command from the

SRA toolkit. Then, we compress the fastq files.

(3)

cmd <- paste0("fastq-dump -O ./ --split-3 ", sra_files)

for(i in 1:length(cmd)) system(cmd[i])

system("gzip *.fastq")

4 Downloading the reference genome

To run kallisto, you need to download a FASTA formatted file of target sequences:

system("wget ftp://ftp.ensembl.org/pub/release-70/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.70.cdna.all.fa.gz") system("gunzip Drosophila_melanogaster.BDGP5.70.cdna.all.fa.gz")

The output produced by kallisto contains only the transcript IDs. To add the corre- sponding gene IDs, we need to download the gene model annotation in GTF format:

system("wget ftp://ftp.ensembl.org/pub/release-70/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.70.gtf.gz") system("gunzip Drosophila_melanogaster.BDGP5.70.gtf.gz")

5 Transcript quantification with kallisto

We create a metadata file where each row corresponds to a collection of information needed for a single call of kallisto. The pasilla data consists of paired-end and single- end samples. When you run kallisto on single-end reads, you have to specify an

-l

option which defines the average fragment length. It can be found in

sri$avgLength

. There is one sample (GSM461179) which was sequenced using different read lengths.

Therefore, for this sample, we do the transcript quantification for each read length separately and we add the resulting transcript counts in another step.

sri$LibraryName <- gsub("S2_DRSC_","",sri$LibraryName)

metadata <- unique(sri[,c("LibraryName", "LibraryLayout", "SampleName",

"avgLength")])

for(i in seq_len(nrow(metadata))){

indx <- sri$LibraryName == metadata$LibraryName[i]

if(metadata$LibraryLayout[i] == "PAIRED"){

metadata$fastq[i] <- paste0(sri$Run[indx], "_1.fastq.gz ", sri$Run[indx], "_2.fastq.gz", collapse = " ")

}else{

metadata$fastq[i] <- paste0(sri$Run[indx], ".fastq.gz", collapse = " ") }

(4)

}

metadata$condition <- ifelse(grepl("CG8144_RNAi", metadata$LibraryName),

"KD", "CTL")

metadata$UniqueName <- paste0(1:nrow(metadata), "_", metadata$SampleName)

In the first step of kallisto work-flow, we build an index with kallisto index :

cDNA_fasta <- "Drosophila_melanogaster.BDGP5.70.cdna.all.fa"

index <- "Drosophila_melanogaster.BDGP5.70.cdna.all.idx"

cmd <- paste("kallisto index -i", index, cDNA_fasta, sep = " ") cmd

## [1] "kallisto index -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx Drosophila_melanogaster.BDGP5.70.cdna.all.fa"

system(cmd)

The quantification is done with kallisto quant command:

out_dir <- metadata$UniqueName

cmd <- paste("kallisto quant -i", index, "-o", out_dir, "-b 0 -t 5", ifelse(metadata$LibraryLayout == "SINGLE",

paste("--single -l", metadata$avgLength), ""), metadata$fastq) cmd

## [1] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 1_GSM461176 -b 0 -t 5 --single -l 45 SRR031708.fastq.gz SRR031709.fastq.gz SRR031710.fastq.gz SRR031711.fastq.gz SRR031712.fastq.gz SRR031713.fastq.gz"

## [2] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 2_GSM461177 -b 0 -t 5 SRR031714_1.fastq.gz SRR031714_2.fastq.gz SRR031715_1.fastq.gz SRR031715_2.fastq.gz"

## [3] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 3_GSM461178 -b 0 -t 5 SRR031716_1.fastq.gz SRR031716_2.fastq.gz SRR031717_1.fastq.gz SRR031717_2.fastq.gz"

## [4] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 4_GSM461179 -b 0 -t 5 --single -l 45 SRR031718.fastq.gz SRR031719.fastq.gz SRR031720.fastq.gz SRR031721.fastq.gz SRR031722.fastq.gz SRR031723.fastq.gz"

## [5] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 5_GSM461179 -b 0 -t 5 --single -l 44 SRR031718.fastq.gz SRR031719.fastq.gz SRR031720.fastq.gz SRR031721.fastq.gz SRR031722.fastq.gz SRR031723.fastq.gz"

## [6] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 6_GSM461179 -b 0 -t 5 --single -l 40 SRR031718.fastq.gz SRR031719.fastq.gz SRR031720.fastq.gz SRR031721.fastq.gz SRR031722.fastq.gz SRR031723.fastq.gz"

## [7] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 7_GSM461180 -b 0 -t 5 SRR031724_1.fastq.gz SRR031724_2.fastq.gz SRR031725_1.fastq.gz SRR031725_2.fastq.gz"

## [8] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 8_GSM461181 -b 0 -t 5 SRR031726_1.fastq.gz SRR031726_2.fastq.gz SRR031727_1.fastq.gz SRR031727_2.fastq.gz"

## [9] "kallisto quant -i Drosophila_melanogaster.BDGP5.70.cdna.all.idx -o 9_GSM461182 -b 0 -t 5 --single -l 75 SRR031728.fastq.gz SRR031729.fastq.gz"

for(i in 1:length(cmd)) system(cmd[i])

We want to add the gene information and merge the expected transcript counts from different samples into one table.

library(rtracklayer)

gtf_dir <- "Drosophila_melanogaster.BDGP5.70.gtf"

(5)

gtf <- import(gtf_dir)

gt <- unique(mcols(gtf)[, c("gene_id", "transcript_id")]) rownames(gt) <- gt$transcript_id

samples <- unique(metadata$SampleName)

counts_list <- lapply(1:length(samples), function(i){

indx <- which(metadata$SampleName == samples[i])

if(length(indx) == 1){

abundance <- read.table(file.path(metadata$UniqueName[indx],

"abundance.txt"), header = TRUE, sep = "\t", as.is = TRUE) }else{

abundance <- lapply(indx, function(j){

abundance_tmp <- read.table(file.path(metadata$UniqueName[j],

"abundance.txt"), header = TRUE, sep = "\t", as.is = TRUE) abundance_tmp <- abundance_tmp[, c("target_id", "est_counts")]

abundance_tmp })

abundance <- Reduce(function(...) merge(..., by = "target_id", all = TRUE, sort = FALSE), abundance)

est_counts <- rowSums(abundance[, -1])

abundance <- data.frame(target_id = abundance$target_id, est_counts = est_counts, stringsAsFactors = FALSE) }

counts <- data.frame(abundance$target_id, abundance$est_counts, stringsAsFactors = FALSE)

colnames(counts) <- c("feature_id", samples[i]) return(counts)

})

counts <- Reduce(function(...) merge(..., by = "feature_id", all = TRUE, sort = FALSE), counts_list)

### Add gene IDs

counts$gene_id <- gt[counts$feature_id, "gene_id"]

At the end, we keep only the unique samples in our metadata file.

metadata <- unique(metadata[, c("LibraryName", "LibraryLayout", "SampleName",

"condition")]) metadata

## LibraryName LibraryLayout SampleName condition

(6)

## 156 Untreated-1 SINGLE GSM461176 CTL

## 162 Untreated-3 PAIRED GSM461177 CTL

## 164 Untreated-4 PAIRED GSM461178 CTL

## 166 CG8144_RNAi-1 SINGLE GSM461179 KD

## 172 CG8144_RNAi-3 PAIRED GSM461180 KD

## 174 CG8144_RNAi-4 PAIRED GSM461181 KD

## 176 Untreated-6 SINGLE GSM461182 CTL

write.table(metadata, "metadata.txt", quote = FALSE, sep = "\t", row.names = FALSE)

### Final counts with columns sorted as in metadata

counts <- counts[, c("feature_id", "gene_id", metadata$SampleName)]

write.table(counts, "counts.txt", quote = FALSE, sep = "\t", row.names = FALSE)

(7)

APPENDIX

A Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)

## Platform: x86_64-pc-linux-gnu (64-bit)

## Running under: Ubuntu 20.04.2 LTS

##

## Matrix products: default

## BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so

## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so

##

## locale:

## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C

## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C

## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8

## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C

## [9] LC_ADDRESS=C LC_TELEPHONE=C

## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

##

## attached base packages:

## [1] parallel stats4 stats graphics grDevices utils datasets

## [8] methods base

##

## other attached packages:

## [1] rtracklayer_1.52.0 GenomicRanges_1.44.0

## [3] GenomeInfoDb_1.28.0 IRanges_2.26.0

## [5] S4Vectors_0.30.0 BiocGenerics_0.38.0

## [7] PasillaTranscriptExpr_1.20.0 knitr_1.33

##

## loaded via a namespace (and not attached):

## [1] compiler_4.1.0 BiocManager_1.30.15

## [3] restfulr_0.0.13 highr_0.9

## [5] XVector_0.32.0 MatrixGenerics_1.4.0

## [7] bitops_1.0-7 tools_4.1.0

## [9] zlibbioc_1.38.0 digest_0.6.27

## [11] lattice_0.20-44 evaluate_0.14

## [13] rlang_0.4.11 Matrix_1.3-3

## [15] DelayedArray_0.18.0 yaml_2.2.1

## [17] xfun_0.23 GenomeInfoDbData_1.2.6

## [19] stringr_1.4.0 Biostrings_2.60.0

## [21] grid_4.1.0 Biobase_2.52.0

## [23] XML_3.99-0.6 BiocParallel_1.26.0

(8)

## [27] Rsamtools_2.8.0 codetools_0.2-18

## [29] htmltools_0.5.1.1 matrixStats_0.58.0

## [31] GenomicAlignments_1.28.0 SummarizedExperiment_1.22.0

## [33] BiocStyle_2.20.0 stringi_1.6.2

## [35] RCurl_1.98-1.3 crayon_1.4.1

## [37] rjson_0.2.20 BiocIO_1.2.0

B References References

[1] A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park, S. Dudoit, S. E.

Brenner, and B. R. Graveley, “Conservation of an RNA regulatory map between Drosophila and mammals.,” Genome research, vol. 21, no. 2, pp. 193–202, 2011.

[2] N. L. Bray, H. Pimentel, P. Melsted, and L. Pachter, “Near-optimal RNA-Seq

quantification.”.

References

Related documents

Thigh High Closed Toe Standard Width with Soft Silicone Band Petite, Regular I-VI 5252 Thigh High Closed Toe Wide Width with Soft Silicone Band Petite, Regular I-VI 5252

If the ACME (Analog Comparator Multiplexer Enabled) bit in ADCSRB is set while MUX3 in ADMUX is '1' (ADMUX[3:0]=1xxx), all MUXes are turned off until the ACME bit is cleared..

The substance of these apologies is also notable: while both refer generally to a long history of displacement, appropriation, assimilation, and inequality, they also focus on

discharged from the trust, or refuses or becomes, in the opinion of a principal civil court of original jurisdiction, unfit or personally incapable to act in the trust, or accepts

The Bellport Academic Center services students who have mild behavioral and/or intensive counseling concerns and/or mild to moderate learning disabilities. Ninth and tenth

• Bilateral negotiations—When the examination of the foreign trade régime is suffi- ciently advanced, members of the working party and the applicant commence bila t- eral market

D-Pantothenic Acid (calcium pantothenate) 50 mg Vitamin B6 (pyridoxine hydrochloride) 50 mg Vitamin B12 (cyanocobalamin) 50 mcg Biotin 50 mcg Folic Acid 1 mg Lipotropic Factors: