• No results found

Discovering tissue-specific novel genes using Cufflinks 151

4.   RNA-seq and directional RNA-seq for novel gene discovery in the

4.2   RNA-seq and Strand –specific RNA-seq (ssRNA-seq) experiment design 142

4.3.6   Discovering tissue-specific novel genes using Cufflinks 151

Cufflinks (version 1.0.3) was run for each sample separately. Transcripts were assembled based on the TopHat alignment files (accepted_hits.bam), using existing genome annotations recodes in FlyBase 5.36 but also allowing for novel transcripts. Output: transcripts.gtf (Cufflinks assembled isoforms);

Isoforms.fpkm_ tracking (isoform level expression value RPKM); genes.fpkm_ tracking (gene level expression value RPKM). This produced a list of all the generated transcripts.gtf which was stored in the file ‘assemblies.txt’ for every sample.

Cuffmerge was used to convert the input from files gtf to sam format and then merged Cufflinks generated transcripts .gtf files (also specified in the list as ‘assemblies.txt’ for each tissue) into a single merged.gtf file. The output files were: transcripts.gtf; isoform.fpkm_tracking; genes.fpkm_tracking for all tissue samples.

The transcripts.gtf file was compared against the reference genome annotation file by Cuffcompare and the final merged.gtf file was generated (examples of merged_gtf file is shown in Table 4-6). This merged file contained newly built gene_id (XLOC_...), transcript_id (TCONS_...), exon start and end point, and transcript class_codes indicating the possible type of the transcripts.

Cuffcompare was used to discover the novel genes or transcripts in merged.gtf file that were located in intergenic regions (class_code “u”) or fell into the intronic regions (class_code “i”). However, the novel genes discovered were not necessarily tissue-specific. Cuffdiff allowed the identification of tissue-specific novel genes. The example of a merged.gtf file with class_code is shown in Table 4-6. The summary of all the class_codes of the entire tissues is listed in Table 4- 7.

Table 4-6 Example of an excerpt from a merged.gtf file with the reported class_code

Note that this is only an excerpt from merge.gtf file (the original file is too big to include in whole in the thesis). This file contains all the tissue samples. It shows the chromosome position, the gene prediction source (Cufflinks or FlyBase), gene identification number (XLOC_), transcripts

identification number (TCONS_), exon numbers in the transcripts, and class_code. tss_id is the ID of this transcript's inferred start site. Determines which primary transcript this processed transcript is believed to come from.

chr2L Cufflinks gene_id  "XLOC_000001";  transcript_id  "TCONS_00000001";  exon_number  "1";  class_code  "j";  tss_id  "TSS1"; chr2L Cufflinks gene_id  "XLOC_000001";  transcript_id  "TCONS_00000001";  exon_number  "2"";  class_code  "j";  tss_id  "TSS1"; chr2L Cufflinks gene_id  "XLOC_000001";  transcript_id  "TCONS_00000001";  exon_number  "3";  class_code  "j";  tss_id  "TSS1"; chr2L FlyBase gene_id  "XLOC_000001";  transcript_id  "TCONS_00000003";  exon_number  "1";  class_code  "=";  tss_id  "TSS1"; chr2L FlyBase gene_id  "XLOC_000001";  transcript_id  "TCONS_00000003";  exon_number  "2"  class_code  "=";  tss_id  "TSS1"; chr2L Cufflinks gene_id  "XLOC_000002";  transcript_id  "TCONS_00000004";  exon_number  "1"  class_code  "s";  tss_id  "TSS2"; chr2L Cufflinks gene_id  "XLOC_000002";  transcript_id  "TCONS_00000004";  exon_number  "2"  class_code  "s";  tss_id  "TSS2"; chr2L Cufflinks gene_id  "XLOC_000014";  transcript_id  "TCONS_00000051";  exon_number  "1";    class_code  "x";

chr2L Cufflinks gene_id  "XLOC_000015";  transcript_id  "TCONS_00000052";  exon_number  "1";  class_code  "o";

chr2L Cufflinks gene_id  "XLOC_000806";  transcript_id  "TCONS_00001671";  exon_number  "1";    class_code  "u";  tss_id  "TSS984"; chr2L Cufflinks gene_id  "XLOC_000806";  transcript_id  "TCONS_00001671";  exon_number  "2";    class_code  "u";  tss_id  "TSS984"; chr2L Cufflinks gene_id  "XLOC_000831";  transcript_id  "TCONS_00001706";  exon_number  "1";  class_code  "x";  tss_id  "TSS1008"; chr2L Cufflinks gene_id  "XLOC_000857";  transcript_id  "TCONS_00001767";  class_code  "u";  tss_id  "TSS1049";

chr2L Cufflinks gene_id  "XLOC_000857";  transcript_id  "TCONS_00001767";  exon_number  "2";  ;  class_code  "u";  tss_id  "TSS1049"; chr2L Cufflinks gene_id  "XLOC_000857";  transcript_id  "TCONS_00001767";  exon_number  "3";  class_code  "u";  tss_id  "TSS1049"; chr2L Cufflinks gene_id  "XLOC_000867";  transcript_id  "TCONS_00001805";  exon_number  "1";  class_code  "u";  tss_id  "TSS1068"; chr2L Cufflinks gene_id  "XLOC_000867";  transcript_id  "TCONS_00001805";  exon_number  "2";    class_code  "u";  tss_id  "TSS1068"; chr2L Cufflinks gene_id  "XLOC_000886";  transcript_id  "TCONS_00001843";  exon_number  "1";  class_code  "x";

chr2L Cufflinks gene_id  "XLOC_000887";  transcript_id  "TCONS_00001844";  exon_number  "1";    class_code  "j";  tss_id  "TSS1091"; chr2L Cufflinks gene_id  "XLOC_000887";  transcript_id  "TCONS_00001844";  exon_number  "2";  class_code  "j";  tss_id  "TSS1091"; chr2L Cufflinks gene_id  "XLOC_000887";  transcript_id  "TCONS_00001844";  exon_number  "3";  class_code  "j";  tss_id  "TSS1091"; chr2L Cufflinks gene_id  "XLOC_000887";  transcript_id  "TCONS_00001844";  exon_number  "4";    ;  class_code  "j";  tss_id  "TSS1091"; chr2L FlyBase gene_id  "XLOC_000901";  transcript_id  "TCONS_00001877";  exon_number  "1";  class_code  "=";

chr2L Cufflinks gene_id  "XLOC_000902";  transcript_id  "TCONS_00001878";  exon_number  "1";  class_code  "x";

chr2R Cufflinks gene_id  "XLOC_003782";  transcript_id  "TCONS_00007807";  exon_number  "1";  class_code  "x";  tss_id  "TSS4243"; chr2R Cufflinks gene_id  "XLOC_003782";  transcript_id  "TCONS_00007807";  exon_number  "2";  class_code  "x";  tss_id  "TSS4243"; chr2R FlyBase gene_id  "XLOC_003783";  transcript_id  "TCONS_00007808";  exon_number  "1";  class_code  "=";  tss_id  "TSS4244"; chr2R Cufflinks gene_id  "XLOC_003784";  transcript_id  "TCONS_00007810";  exon_number  "1";  class_code  "j";  tss_id  "TSS4246"; chr2R Cufflinks gene_id  "XLOC_003784";  transcript_id  "TCONS_00007810";  exon_number  "2";  class_code  "j";  tss_id  "TSS4246"; chr2R Cufflinks gene_id  "XLOC_003784";  transcript_id  "TCONS_00007810";  exon_number  "3";  class_code  "j";  tss_id  "TSS4246"; chr2R Cufflinks gene_id  "XLOC_003784";  transcript_id  "TCONS_00007810";  exon_number  "4";  class_code  "j";  tss_id  "TSS4246"; chr2R Cufflinks gene_id  "XLOC_004549";  transcript_id  "TCONS_00009527";  exon_number  "1";;  class_code  "u";  tss_id  "TSS5130"; chr2R Cufflinks gene_id  "XLOC_004549";  transcript_id  "TCONS_00009527";  exon_number  "2";  ;  class_code  "u";  tss_id  "TSS5130"; chr2R Cufflinks gene_id  "XLOC_004550";  transcript_id  "TCONS_00009528";  exon_number  "1";  class_code  "u";  tss_id  "TSS5131"; chr2R Cufflinks gene_id  "XLOC_004550";  transcript_id  "TCONS_00009528";  exon_number  "2";  class_code  "u";  tss_id  "TSS5131"; chr3L Cufflinks gene_id  "XLOC_006483";  transcript_id  "TCONS_00013701";  exon_number  "1";  class_code  "j";  tss_id  "TSS7243"; chr3L Cufflinks gene_id  "XLOC_006483";  transcript_id  "TCONS_00013701";  exon_number  "2";  ;  class_code  "j";  tss_id  "TSS7243"; chr3L Cufflinks gene_id  "XLOC_006483";  transcript_id  "TCONS_00013701";  exon_number  "3";  class_code  "j";  tss_id  "TSS7243"; chr3L Cufflinks gene_id  "XLOC_007340";  transcript_id  "TCONS_00015452";  exon_number  "1";  class_code  "u";  tss_id  "TSS8255"; chr3L Cufflinks gene_id  "XLOC_007340";  transcript_id  "TCONS_00015452";  exon_number  "2";  ;  class_code  "u";  tss_id  "TSS8255"; chr3L Cufflinks gene_id  "XLOC_007340";  transcript_id  "TCONS_00015453";  exon_number  "1";    class_code  "u";  tss_id  "TSS8255"; chr3L Cufflinks gene_id  "XLOC_007340";  transcript_id  "TCONS_00015453";  exon_number  "2";    class_code  "u";  tss_id  "TSS8255"; chr3L Cufflinks gene_id  "XLOC_007340";  transcript_id  "TCONS_00015453";  exon_number  "3";  class_code  "u";  tss_id  "TSS8255"; chr3R Cufflinks gene_id  "XLOC_011674";  transcript_id  "TCONS_00024666";  exon_number  "1";  class_code  "u";  tss_id  "TSS13385"; chr3R Cufflinks gene_id  "XLOC_011674";  transcript_id  "TCONS_00024666";  exon_number  "2";  ;  class_code  "u";  tss_id  "TSS13385"; chr3R Cufflinks gene_id  "XLOC_011675";  transcript_id  "TCONS_00024667";  exon_number  "1";  class_code  "x";  tss_id  "TSS13386"; chr3R Cufflinks gene_id  "XLOC_011675";  transcript_id  "TCONS_00024667";  exon_number  "2";  class_code  "x";  tss_id  "TSS13386"; chr3R Cufflinks gene_id  "XLOC_011676";  transcript_id  "TCONS_00024668";  exon_number  "1";    class_code  "=";  tss_id  "TSS13387"; chr3R Cufflinks gene_id  "XLOC_011676";  transcript_id  "TCONS_00024668";  exon_number  "2";  class_code  "=";  tss_id  "TSS13387"; chr3R Cufflinks gene_id  "XLOC_011676";  transcript_id  "TCONS_00024668";  exon_number  "3";  class_code  "=";  tss_id  "TSS13387";

Table 4-7 Summary of classified transcripts in the merged.gtf file for all tissues

Class_code id Description Total number of transcripts

= Complete match of intron chain 114761

u Unknown, intergenic transcript 2568

o Generic exonic overlap with a reference transcript 1114

j Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript

76460

x Exonic overlap with reference on the opposite strand 1071

s An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)

205

p Possible polymerase run-on fragment (within 2Kbases

of a reference transcript) 0

r Repeat. Currently determined by looking at the soft- masked reference sequence and applied to transcripts where at least 50% of the bases are lower case.

0

e Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a

possible pre-mRNA fragment. 0

i A transfrag falling entirely within a reference intron 0

c Contained 0

Total 196179

Note that this is the summary of all classified transcripts in the merged file produced by Cufflinks for heads, testes, tubules and whole flies. The class_code id is defined by Cufflinks.