Gene Expression Analysis Annotation/Functional Analysis

(1)

Gene Expression Analysis

Annotation/Functional

(2)

Annotation/Functional Analysis

DE analysis at gene level (Trinity)

• Filtered genes from DESeq2 -> filtered isoforms

• In Excel

• Save list of names of "genes", e.g.

TRINITY_DN7498_c0_g1

• Divide genes into groups based on expression pattern

• Up, Equal, Down by LFC cutoff and P-value

• Two comparisons between time points 1, 2, 3

UU, UE, UD

EU, EE, ED

DU, DE, DD

• Save lists of genes to separate text files

• Extract all isoforms corresponding to filtered genes

from transcriptome fasta file

fasta_select.py list_file fasta_file > isoform.fa

• Need peptide sequences for most annotation

analyses

• Use TransDecoder to get probable protein sequences

Computational Genomics 2020

Week 9

# 2

DD

2%

DE

40%

ED

2%

DU

8%

UD

4%

EU

3%

UE

38%

UU

3%

(3)

Annotation/Functional Analysis

Observed transcripts – genes to transcripts

• Stringtie

• use gffread to write out all transcript sequences from merged GTF file

• copy transcript file and remove isoform numbers

• I used vi with the command

:%s/\.\d* CDS/ CDS/

replaces period, followed by any number of digits, space, CDS with space, CDS

• work on a copy, after editing this file will have duplicate names

• write out list of selected genes from DESeq2 analysis

• all genes without padj or LFC selection

• one gene per line (if you used the gene count file there will be no transcript numbers)

• use seqtk subseq to select the sequences

seqtk subseq merged_transcript.fa stringtie_filtered.list > stringtie_filtered.fa

• count number of genes with grep and wc

• 20610 selected genes -> 36310 transcripts

>MSTRG.1

.2

CDS=1-869

(4)

Transcriptome Assembly

Trinity output

• Trinity.fasta.gene_trans_map

maps transcripts to genes

• recursive_trinity.cmds

butterfly commands

• recursive_trinity.cmds.completed

successful commands

• recursive_trinity.cmds.failed

rerun these

Computational Genomics 2020

Week 7

# 4

(5)

Transcriptome Assembly

Trinity output

• Trinity produces many files

• Trinity files take a lot of space

• You

MUST

compress

• I suggest, backup the entire result with tar, then

(6)

Transcriptome Assembly

Trinity output

• cleanup

• tar with parallel compression

• takes about 77 min

• final size of 200421_avocado_trinity.tar.tgz = 105 G

Computational Genomics 2020

Week 7

# 6

(7)

Transcriptome Assembly

Trinity genome guided assembly

• Use reads mapped against reference genome

• I used reads from HISAT from stringtie analysis

• merge into single bam file with samtools merge

• recommended only if genome is fairly complete

• with defaults, produced 234415 predicted transcripts (compared to 265348 for de novo)

(8)

Transcriptome Assembly

Trinity results

• By default, results are in the directory trinity_out_dir/Trinity.fasta

• Change this name to a more informative one immediately

• avocado_trinity_200422.fa

• how many predicted transcripts

grep '>' avocado_trinity_200422.fa | wc

265348 1473154 20582035

• in predicted transcripts file each sequence is on a single line. May not work for all downstream

programs

Computational Genomics 2020

Week 7

# 8

>TRINITY_DN8_c0_g1_i3 len=630 path=[0:0-251 2:252-503 3:504-504 5:505-629]

ATCAAATCTTACGAGGTGTGAAAGTCAGGTTCCATGACAAAGAGGGAAAAGGCTGCAGACAGAAAGAAACAGCCACTCTGCAACTCAGTATTAATAGGAAAATG

TCTTTAATGGAGAAAAGCTCTCATCCAAGGGCGGGAATAATAGCAGGCCTGTTTTGAAAATGATTTCATTTTCATCATCTACCTGTGCGTTCTTAAATAAGTTG

GAGCTGAAAACTAAGCACTCTTTGAAACTCTCTCTCTATATCAGGTAGTAAGAAAAAGAGAAGAAAGAAGAAAGAAATGCCCGTCTTTGATCTTATATCTGCTA

TCCAACTGCATTTCTACTGACATTCTATAGATTTTTATGCCATTCACTACTTCTTGTCATCTTTTTACCTGGGTTTGTTTGGACTATGGATTTCTGGTTTATAT

ATGAATCAATAAATTATGGATATAAACCTACAAGTTTTTCCCCTTTTTCTTGACTGGGAGTTCGAAAATCAGTGTTTTTGCTTTGATGGGTACTTCTGTTCTCT

GTTCCACCCTCTCTCGATCTTTTCTGTTGGTTTTTCTCTGATGGGTTATTCCTGTATTCAGCTGAGGAGTTGATGTCAGTGAATTTTTTTTTTTTTTTTTCAGT

TTTCTG

>TRINITY_DN8_c0_g1_i1 len=3405 path=[0:0-251 2:252-503 3:504-504 4:505-3404]

(9)

Transcriptome Assembly

Trinity predicted transcript IDs

• _DN read cluster, (inchworm) contain overlapping kmers

• _c component, (chrysalis) have read support

• _g gene (butterfly), alternative de Bruijn graph taces

• -i isoform

>TRINITY_DN8_c0_g1_i3 len=630 path=[0:0-251 2:252-503 3:504-504 5:505-629]

>TRINITY_DN8_c0_g1_i1 len=3405 path=[0:0-251 2:252-503 3:504-504 4:505-3404]

>TRINITY_DN8_c0_g1_i2 len=3207 path=[1:0-305 3:306-306 4:307-3206]

>TRINITY_DN8_c0_g1_i4 len=3152 path=[0:0-251 4:252-3151]

>TRINITY_DN8_c0_g2_i4 len=3043 path=[2:0-675 4:676-819 5:820-823 6:824-2349 8:2350-3042]

>TRINITY_DN8_c0_g2_i3 len=788 path=[1:0-94 8:95-787]

>TRINITY_DN8_c0_g2_i6 len=338 path=[0:0-121 4:122-265 5:266-269 7:270-337]

>TRINITY_DN8_c0_g2_i2 len=2489 path=[0:0-121 4:122-265 5:266-269 6:270-1795 8:1796-2488]

>TRINITY_DN8_c0_g2_i5 len=2343 path=[3:0-119 5:120-123 6:124-1649 8:1650-2342]

(10)

Annotation/Functional Analysis

Some selected genes

• Multiple isoforms of many, which are interesting?

Computational Genomics 2020

Week 9

# 10

(11)

Annotation/Functional Analysis

TransDecoder (trinity)

• Try to identify "best" proteins

• For stranded data run in –S mode

• TransDecoder.LongOrfs

Find the longest ORFs

• minimum ORF is 100, change with –m

• --gene_trans_map

• ORFS must start with M

not good for fragments or alternative start codons

• TransDecoder.Predict

Evaluate ORFs

• 5

th

_{order (hexamer) Markov model based on longest ORFs in set of predicted transcripts}

hexamer model includes

amino acid frequencies in proteins

amino acid pair frequencies in proteins

codon usage in the organism of interest

(12)

TransDecoder.Predict

• longest_orfs.cds.scores

-log likelihood

Computational Genomics 2020

Week 9

# 12

(13)

Annotation/Functional Analysis

TransDecoder.LongOrfs predicted coding regions

>DN8_c0_g1 len=630 path=[0:0-251 2:252-503 3:504-504 5:505-629]

>DN8_c0_g1 len=3405 path=[0:0-251 2:252-503 3:504-504 4:505-3404]

>DN8_c0_g1 len=3207 path=[1:0-305 3:306-306 4:307-3206]

(14)

Annotation/Functional Analysis

TransDecoder.LongOrfs predicted coding regions

• are g1 and g2 really different genes

• Is the longest predicted transcript the best

• is the longest predicted ORF the best

Computational Genomics 2020

Week 9

# 14

>DN8_c0_g2.p1 type:complete

len:676

gc:universal DN8_c0_g2:719-2746(+)

>DN8_c0_g2.p2 type:complete len:98 gc:universal DN8_c0_g2:1297-1004(-)

>DN8_c0_g2.p3 type:complete len:95 gc:universal DN8_c0_g2:872-588(-)

>DN8_c0_g2.p4 type:complete len:74 gc:universal DN8_c0_g2:1173-1394(+)

>DN8_c0_g2.p5 type:complete len:67 gc:universal DN8_c0_g2:1914-1714(-)

>DN8_c0_g2.p6 type:complete len:66 gc:universal DN8_c0_g2:2774-2971(+)

>DN8_c0_g2.p7 type:complete len:61 gc:universal DN8_c0_g2:2535-2717(+)

>DN8_c0_g2.p8 type:complete len:53 gc:universal DN8_c0_g2:1452-1294(-)

>DN8_c0_g2.p9 type:complete len:96 gc:universal DN8_c0_g2:204-491(+)

>DN8_c0_g2.p10 type:complete len:66 gc:universal DN8_c0_g2:519-716(+)

>DN8_c0_g2.p11 type:complete len:61 gc:universal DN8_c0_g2:280-462(+)

>DN8_c0_g2.p12 type:complete len:60 gc:universal DN8_c0_g2:71-250(+)

>DN8_c0_g2.p13 type:complete

len:676

gc:universal DN8_c0_g2:165-2192(+)

>DN8_c0_g2.p14 type:complete len:98 gc:universal DN8_c0_g2:743-450(-)

>DN8_c0_g2.p15 type:complete len:74 gc:universal DN8_c0_g2:619-840(+)

>DN8_c0_g2.p16 type:complete len:67 gc:universal DN8_c0_g2:1360-1160(-)

>DN8_c0_g2.p17 type:complete len:66 gc:universal DN8_c0_g2:2220-2417(+)

>DN8_c0_g2.p18 type:complete len:66 gc:universal DN8_c0_g2:318-121(-)

>DN8_c0_g2.p19 type:complete len:61 gc:universal DN8_c0_g2:1981-2163(+)

>DN8_c0_g2.p20 type:complete len:53 gc:universal DN8_c0_g2:898-740(-)

>DN8_c0_g2.p21 type:complete

len:644

gc:universal DN8_c0_g2:115-2046(+)

>DN8_c0_g2.p22 type:complete len:98 gc:universal DN8_c0_g2:597-304(-)

>DN8_c0_g2.p23 type:complete len:74 gc:universal DN8_c0_g2:473-694(+)

>DN8_c0_g2.p24 type:complete len:67 gc:universal DN8_c0_g2:1214-1014(-)

>DN8_c0_g2.p25 type:complete len:66 gc:universal DN8_c0_g2:2074-2271(+)

>DN8_c0_g2.p26 type:complete len:61 gc:universal DN8_c0_g2:1835-2017(+)

>DN8_c0_g2.p27 type:complete len:53 gc:universal DN8_c0_g2:752-594(-)

>DN8_c0_g2 len=3043 path=[2:0-675 4:676-819 5:820-823 6:824-2349 8:2350-3042]

>DN8_c0_g2 len=788 path=[1:0-94 8:95-787]

>DN8_c0_g2 len=338 path=[0:0-121 4:122-265 5:266-269 7:270-337]

(15)

Annotation/Functional Analysis

TransDecoder.LongOrfs predicted coding regions

• is DN78 a coding gene?

• which is the best ORF for DN18

>DN78_c0_g1 len=891 path=[0:0-890] >DN18_c0_g1 len=1354 path=[2:0-119 3:120-152 4:153-178 5:179-230 7:231-265 9:266-284 11:285-374 12:375-424 14:425-578 15:579-616 17:617-689 18:690-869 19:870-1027 22:1028-1353] >DN18_c0_g1 len=1791 path=[10:0-721 11:722-811 12:812-861 14:862-1015 15:1016-1053 17:1054-1126 18:1127-1306 19:1307-1464 22:1465-1790] >DN18_c0_g1 len=1384 path=[0:0-257 4:258-283 6:284-335 7:336-370 8:371-479 12:480-529 13:530-683 15:684-721 16:722-794 18:795-974 20:975-1383] >DN18_c0_g1 len=399 path=[18:0-179 19:180-337 21:338-398] >DN18_c0_g1 len=1252 path=[0:0-257 4:258-283 5:284-335 7:336-370 9:371-389 10:390-1111 11:1112-1201 12:1202-1251] >DN18_c0_g1 len=1453 path=[1:0-218 3:219-251 4:252-277 5:278-329 7:330-364 9:365-383 11:384-473 12:474-523 14:524-677 15:678-715 17:716-788 18:789-968 19:969-1126 22:1127-1452]

>DN78_c0_g1.p1 type:5prime_partial

len:193

gc:universal DN78_c0_g1:2-580(+)

>DN78_c0_g1.p2 type:complete len:136 gc:universal DN78_c0_g1:547-140(-)

>DN78_c0_g1.p3 type:3prime_partial len:90 gc:universal DN78_c0_g1:267-1(-)

>DN78_c0_g1.p4 type:5prime_partial len:67 gc:universal DN78_c0_g1:889-689(-)

>DN78_c0_g1.p5 type:5prime_partial len:60 gc:universal DN78_c0_g1:1-180(+)

>DN18_c0_g1.p1 type:complete

len:327

gc:universal DN18_c0_g1:130-1110(+)

>DN18_c0_g1.p2 type:complete len:109 gc:universal DN18_c0_g1:749-1075(+)

>DN18_c0_g1.p3 type:complete len:92 gc:universal DN18_c0_g1:395-120(-)

>DN18_c0_g1.p4 type:3prime_partial len:75 gc:universal DN18_c0_g1:1132-1353(+)

>DN18_c0_g1.p5 type:complete len:59 gc:universal DN18_c0_g1:1166-1342(+)

>DN18_c0_g1.p6 type:complete

len:302

gc:universal DN18_c0_g1:642-1547(+)

>DN18_c0_g1.p7 type:complete len:109 gc:universal DN18_c0_g1:1186-1512(+)

>DN18_c0_g1.p8 type:complete len:81 gc:universal DN18_c0_g1:832-590(-)

>DN18_c0_g1.p9 type:3prime_partial len:75 gc:universal DN18_c0_g1:1569-1790(+)

>DN18_c0_g1.p10 type:complete len:59 gc:universal DN18_c0_g1:1603-1779(+)

>DN18_c0_g1.p11 type:complete

len:327

gc:universal DN18_c0_g1:235-1215(+)

>DN18_c0_g1.p12 type:complete len:198 gc:universal DN18_c0_g1:1185-592(-)

>DN18_c0_g1.p13 type:complete len:68 gc:universal DN18_c0_g1:1118-915(-)

>DN18_c0_g1.p14 type:complete len:66 gc:universal DN18_c0_g1:500-303(-)

>DN18_c0_g1.p15 type:complete len:58 gc:universal DN18_c0_g1:854-1027(+)

>DN18_c0_g1.p16 type:5prime_partial len:126 gc:universal DN18_c0_g1:1-378(+)

>DN18_c0_g1.p17 type:complete len:109 gc:universal DN18_c0_g1:59-385(+)

>DN18_c0_g1.p18 type:complete len:81 gc:universal DN18_c0_g1:1222-980(-)

>DN18_c0_g1.p19 type:3prime_partial len:74 gc:universal DN18_c0_g1:1032-1250(+)

>DN18_c0_g1.p20 type:complete len:64 gc:universal DN18_c0_g1:235-426(+)

(16)

Annotation/Functional Analysis

TransDecoder.LongOrfs predicted coding regions

• Longest transcript

• Longest ORF

Computational Genomics 2020

Week 9

# 16

>DN53_c0_g1 len=3139 path=[0:0-2584 2:2585-3138]

>DN53_c0_g1 len=3254 path=[0:0-2584 1:2585-2699 2:2700-3253]

>DN53_c0_g3 len=2958 path=[0:0-555 1:556-2736 4:2737-2957]

>DN53_c0_g3 len=3068 path=[0:0-555 1:556-2736 3:2737-2846 4:2847-3067]

>DN53_c0_g3 len=698 path=[0:0-555 2:556-697]

(17)

Annotation/Functional Analysis

TransDecoder.LongOrfs predicted coding regions

• Questions

• are different _g isoforms really different genes

• Is the longest predicted transcript the best

• is the longest predicted ORF the best

• How good is transdecoder and predicting the CDS?

• Method

• compare to protein library

• blastp of predicted protein

• blastx of predicted transcript

• use diamond, 1000 times faster than Blast

• use uniref50 condensed database (clustered at 50% identity)

• best should have longest match to known protein

(18)

Annotation/Functional Analysis

Transdecoder predicted coding regions

• diamond blastx

• make sure to set –threads or max will be used

Computational Genomics 2020

Week 9

# 18

diamond v0.9.14.115 | by Benjamin Buchfink <[email protected]>

Licensed under the GNU AGPL <https://www.gnu.org/licenses/agpl.txt>

Check http://github.com/bbuchfink/diamond for updates.

Syntax: diamond COMMAND [OPTIONS]

Commands:

makedb

Build DIAMOND database from a FASTA file

blastp

Align amino acid query sequences against a protein reference database

blastx

Align DNA query sequences against a protein reference database

view View DIAMOND alignment archive (DAA) formatted file

help Produce help message

version Display version information

getseq

Retrieve sequences from a DIAMOND database file

dbinfo

Print information about a DIAMOND database file

General options:

--threads (-p) number of CPU threads

--db (-d) database file

(19)

Annotation/Functional Analysis

Transdecoder predicted coding regions

--outfmt (-f) output format

0 = BLAST pairwise

5 = BLAST XML

6 = BLAST tabular, Value 6 may be followed by a space-separated list of these keywords:

qseqid means Query Seq - id

qlen means Query sequence length

sseqid means Subject Seq - id

sallseqid means All subject Seq - id(s), separated by a ';'

slen means Subject sequence length

qstart means Start of alignment in query

qend means End of alignment in query

sstart means Start of alignment in subject

send means End of alignment in subject

qseq means Aligned part of query sequence

sseq means Aligned part of subject sequence

evalue means Expect value

bitscore means Bit score

score means Raw score

length means Alignment length

pident means Percentage of identical matches

nident means Number of identical matches

mismatch means Number of mismatches

positive means Number of positive - scoring matches

gapopen means Number of gap openings

gaps means Total number of gaps

ppos means Percentage of positive - scoring matches

qframe means Query frame

btop means Blast traceback operations(BTOP)

staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)

stitle means Subject Title

salltitles means All Subject Title(s), separated by a '<>'

qcovhsp means Query Coverage Per HSP

qtitle means Query title

(20)

Annotation/Functional Analysis

Transdecoder predicted coding regions

• Diamond/blastx search vs uniref50

Computational Genomics 2020

Week 9

# 20

DN8_c0_g1.p5 675 1-674 A0A1U7Z5Q7 679 1-679 700 2.0e-129 470.3 A0A1U7Z5Q7 filament-like plant protein isoform X1 n=12 Tax=Magnoliopsida TaxID=3398 RepID=A0A1U7Z5Q7_NELNU DN8_c0_g1.p5 675 1-407 A0A5B6ZQX7 449 1-404 409 4.2e-127 462.6 A0A5B6ZQX7 Putative filament-like plant protein (Fragment) n=2 Tax=Pentapetalae TaxID=1437201

DN8_c0_g1.p5 675 1-666 F6H4F3 672 1-666 674 4.1e-122 446.0 F6H4F3 Uncharacterized protein n=87 Tax=Mesangiospermae TaxID=1437183 RepID=F6H4F3_VITVI

DN8_c0_g1.p16 675 1-666 F6H4F3 672 1-666 674 4.1e-122 446.0 F6H4F3 Uncharacterized protein n=87 Tax=Mesangiospermae TaxID=1437183 RepID=F6H4F3_VITVI

DN8_c0_g1.p27 675 1-666 F6H4F3 672 1-666 674 4.1e-122 446.0 F6H4F3 Uncharacterized protein n=87 Tax=Mesangiospermae TaxID=1437183 RepID=F6H4F3_VITVI DN8_c0_g2.p1 676 1-407 A0A5B6ZQX7 449 1-406 411 9.1e-130 471.5 A0A5B6ZQX7 Putative filament-like plant protein (Fragment) n=2 Tax=Pentapetalae TaxID=1437201

DN8_c0_g2.p1 676 1-668 A0A1U7Z5Q7 679 1-673 679 6.1e-126 458.8 A0A1U7Z5Q7 filament-like plant protein isoform X1 n=12 Tax=Magnoliopsida TaxID=3398 RepID=A0A1U7Z5Q7_NELNU DN8_c0_g2.p1 676 1-584 A0A4S4DQK1 735 1-645 650 2.8e-123 449.9 A0A4S4DQK1 Uncharacterized protein n=33 Tax=Mesangiospermae TaxID=1437183 RepID=A0A4S4DQK1_CAMSI

DN8_c0_g2.p13 676 1-407 A0A5B6ZQX7 449 1-406 411 9.1e-130 471.5 A0A5B6ZQX7 Putative filament-like plant protein (Fragment) n=2 Tax=Pentapetalae TaxID=1437201

(21)

Annotation/Functional Analysis

10 20 30 40 50 60 70 80

DN8_c0 MENRSWLWRKKSSEKSPGETESSGSVSSHSERFSDDQEASRGPPNHSQSPEISSNLAGSKVQDTVKSLTERLSAALSNIS :: ::::::.:::::::::::::::::: ::::::::::. :::..:::.:::::::..:::::::::.::::::::: DN8_c0 MERRSWLWRRKSSEKSPGETESSGSVSSG--RFSDDQEASRASPNHTRSPEVSSNLAGSEAQDTVKSLTEKLSAALSNIS

(22)

Annotation/Functional Analysis

Transdecoder predicted coding regions

• DN8_c0_g1 vs DN8_c0_g2

• match to same set of proteins

• similar but not identical assembly block structure

• not block differences expected from mis-assembly

• align well at RNA and protein level, with consistent small differences

• about 16% different at the amino acid level

• more likely to be duplicated gene than alleles

(23)

Annotation/Functional Analysis

Transdecoder predicted coding regions

query

length source

transcript

length

begin end subject

title

length

begin end

align

length E

score

taxonomy

DN78_c0_g1.p1

193 2-580 (+)

891 42 192 A0A3S3R0T1

Lipoyl synthase

151 1 151

151 1.10E-77

123.6 Cinnamomum micranthum

DN78_c0_g1.p2

136 547-140 (-)

no

DN78_c0_g1.p3

90 267-1 (-)

no

DN78_c0_g1.p4

67 889-689 (-)

no

DN78_c0_g1.p5

60 1-180 (+)

no

DN18_c0_g1.p1

327 130-1110 (+)

1345

9 326 A0A498HHW2

Cysteine synthase

1103

786 1103

318 1.50E-162

579.3 Magnoliopsida

DN18_c0_g1.p2

109 749-1075 (+)

no

DN18_c0_g1.p3

92 395-120 (-)

1 74 A0A0A9NZJ6

Uncharacterized

98

25

98

74 2.00E-11

75.5 Arundo donax

DN18_c0_g1.p4

75 1132-1353 (+)

no

DN18_c0_g1.p5

59 1166-1342 (+)

no

DN18_c0_g1.p6

302 642-1547 (+)

1791

27 301 A0A498HHW2

Cysteine synthase

1103

829 1103

275 3.30E-140

505 Magnoliopsida

DN18_c0_g1.p7

109 1186-1512 (+)

no

DN18_c0_g1.p8

81 832-590 (-)

no

DN18_c0_g1.p9

75 1569-1790 (+)

no

DN18_c0_g1.p10

59 1603-1779 (+)

no

DN18_c0_g1.p11

327 235-1215 (+)

399 9 323 F6HTU8

Cysteine synthase

701 89 403

315 4.10E-160

571.2 Mesangiospermae

DN18_c0_g1.p12

198 1185-592 (-)

no

DN18_c0_g1.p13

68 1118-915 (-)

1 67 I3STT7

Uncharacterized

82

16

82

67 3.50E-13

80.9 Lotus japonicus

DN18_c0_g1.p14

66 500-303 (-)

3 63 A0A448Z883

Uncharacterized

354 235 295

61 1.20E-07

62.4 Pseudo-nitzschia multistriata

DN18_c0_g1.p15

58 854-1027 (+)

no

DN18_c0_g1.p16

126 1-378 (+)

1252

1 125 A0A498HHW2

Cysteine synthase

1103

965 1089

125 2.90E-58

231.5 Magnoliopsida

DN18_c0_g1.p17

109 59-385 (+)

no

DN18_c0_g1.p18

81 1222-980 (-)

no

DN18_c0_g1.p19

74 1032-1250 (+)

27 73 B7FKU7

Cysteine synthase

325

51

97

47 5.50E-17

93.6 Pentapetalae

DN18_c0_g1.p20

64 235-426 (+)

4 52 A0A2I4HII2

Cysteine synthase

81

3

51

49 9.20E-16

89.4 Cellular organism

DN18_c0_g1.p21

327 229-1209 (+)

1453

9 326 A0A498HHW2

Cysteine synthase

1103

786 1103

318 1.50E-162

579.3 Magnoliopsida

DN18_c0_g1.p22

126 494-117 (-)

1 74 A0A0A9NZJ6

Uncharacterized

98

25

98

74 2.70E-11

75.5 Arundo donax

DN18_c0_g1.p23

109 848-1174 (+)

no

DN18_c0_g1.p24

75 1231-1452 (+)

no

(24)

Annotation/Functional Analysis

Transdecoder predicted coding regions

Computational Genomics 2020

Week 9

# 24

query length source

transcript

length begin end subject title length begin end align

length E score taxonomy

DN53_c0_g1.p1 769 239-2545 (+) 3139 62 756 A0A443PRB7 SWIM-type 783 6 781 778 1.40E-283 982.6Cinnamomum micranthum DN53_c0_g1.p2 80 2514-2275 (-) no DN53_c0_g1.p3 70 1648-1439 (-) no DN53_c0_g1.p4 65 2333-2139 (-) no DN53_c0_g1.p5 64 3-194 (+) no DN53_c0_g1.p6 63 2752-2940 (+) no DN53_c0_g1.p7 54 2012-1851 (-) no DN53_c0_g1.p8 50 1785-1934 (+) no DN53_c0_g1.p9 50 2531-2382 (-) no

DN53_c0_g1.p10 769239-2545 (+) 3254 62 756 A0A443PRB7 783 783 6 781 778 1.40E-283 982.6Cinnamomum micranthum DN53_c0_g1.p11 80 2514-2275 (-) no DN53_c0_g1.p12 70 1648-1439 (-) no DN53_c0_g1.p13 65 2333-2139 (-) no DN53_c0_g1.p14 64 3-194 (+) no DN53_c0_g1.p15 63 2867-3055 (+) no DN53_c0_g1.p16 54 2012-1851 (-) no DN53_c0_g1.p17 50 1785-1934 (+) no DN53_c0_g1.p18 50 2531-2382 (-) no no

DN53_c0_g3.p1 828205-2688 (+) 2958 58 827 A0A443PRB7 SWIM-type 783 1 783 783 0.00E+00 1353.2Cinnamomum micranthum DN53_c0_g3.p2 187 1031-471 (-) no

DN53_c0_g3.p3 97 461-171 (-) 1 87 A0A0A9E9H7 Uncharacterized 128 24 110 87 5.80E-06 57.4 Arundo donax DN53_c0_g3.p4 81 2508-2266 (-) no DN53_c0_g3.p5 69 :2681-2887 (+) no DN53_c0_g3.p6 61 764-946 (+) no DN53_c0_g3.p7 58 2801-2628 (-) no DN53_c0_g3.p8 55 2-166 (+) no DN53_c0_g3.p9 51 2309-2157 (-) no

DN53_c0_g3.p10 828205-2688 (+) 3068 58 827 A0A443PRB7 SWIM-type 783 1 783 783 0.00E+00 1353.2Cinnamomum micranthum DN53_c0_g3.p11 187 1031-471 (-) no

DN53_c0_g3.p12 97 461-171 (-) 1 87 A0A0A9E9H7 Uncharacterized 128 24 110 87 5.80E-06 57.4 Arundo donax DN53_c0_g3.p13 81 2508-2266 (-) no

DN53_c0_g3.p14 61 764-946 (+) no DN53_c0_g3.p15 55 2-166 (+) no DN53_c0_g3.p16 51 2309-2157 (-) no DN53_c0_g3.p17 50 2848-2997 (+) no

(25)

Annotation/Functional Analysis

Transdecoder predicted coding regions

(26)

Annotation/Functional Analysis

Transdecoder predicted coding regions

Computational Genomics 2020

Week 9

# 26

DN8_c0_g1.p5

675 1 674 A0A1U7Z5Q7

679 1 679 700 2.00E-129

470.3 filament-like

Magnoliopsida

DN8_c0_g1

3405 546 2567 A0A1U7Z5Q7

679 1 679 700 6.80E-130

472.6 filament-like

Magnoliopsida

DN8_c0_g2.p1

676 1 407 A0A5B6ZQX7

449 1 406 411 9.10E-130

471.5 filament-like

Pentapetalae

DN8_c0_g2

3043 719 1939 A0A5B6ZQX7

449 1 406 411 8.00E-130

472.2 filament-like

Pentapetalae

DN78_c0_g1.p1

193 42 192 A0A3S3R0T1

151 1 151 151

1.10E-77

296.6 Lipoyl synthase

Cinnamomum

DN78_c0_g1

891 125 577 A0A3S3R0T1

151 1 151 151

1.00E-77

297.4 Lipoyl synthase

Cinnamomum

DN18_c0_g1.p1

327 9 326 A0A498HHW2 1103 786 1103 318 1.50E-162

579.3 Cysteine synthase

Magnoliopsida

DN18_c0_g1

1354 109 1107 A0A498HHW2 1103

772 1103 333 4.10E-163

581.6 Cysteine synthase

Magnoliopsida

(27)

Annotation/Functional Analysis

TransDecoder.Predict predicted coding regions

• 74 predicted isoforms

• ranked by LL

>TRINITY_DN0_c0_g2_i14.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i14.p1 ORF type:complete len:915 (+),score=220.37 TRINITY_DN0_c0_g2_i14:195-2939(+) >TRINITY_DN0_c0_g2_i29.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i29.p1 ORF type:complete len:908 (+),score=219.50 TRINITY_DN0_c0_g2_i29:195-2918(+) >TRINITY_DN0_c0_g2_i25.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i25.p1 ORF type:complete len:958 (+),score=214.17 TRINITY_DN0_c0_g2_i25:148-2874(+) >TRINITY_DN0_c0_g2_i57.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i57.p1 ORF type:complete len:951 (+),score=213.30 TRINITY_DN0_c0_g2_i57:148-2853(+) >TRINITY_DN0_c0_g2_i16.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i16.p1 ORF type:complete len:879 (+),score=205.19 TRINITY_DN0_c0_g2_i16:148-2637(+) >TRINITY_DN0_c0_g2_i65.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i65.p1 ORF type:complete len:872 (+),score=204.32 TRINITY_DN0_c0_g2_i65:148-2616(+) >TRINITY_DN0_c0_g2_i7.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i7.p1 ORF type:complete len:821 (+),score=199.81 TRINITY_DN0_c0_g2_i7:590-3052(+) >TRINITY_DN0_c0_g2_i41.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i41.p1 ORF type:complete len:833 (+),score=196.98 TRINITY_DN0_c0_g2_i41:148-2499(+) >TRINITY_DN0_c0_g2_i18.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i18.p1 ORF type:complete len:826 (+),score=196.11 TRINITY_DN0_c0_g2_i18:148-2478(+) >TRINITY_DN0_c0_g2_i30.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i30.p1 ORF type:complete len:777 (+),score=190.38 TRINITY_DN0_c0_g2_i30:195-2525(+) >TRINITY_DN0_c0_g2_i48.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i48.p1 ORF type:complete len:770 (+),score=189.50 TRINITY_DN0_c0_g2_i48:195-2504(+) >TRINITY_DN0_c0_g2_i26.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i26.p1 ORF type:complete len:652 (+),score=147.66 TRINITY_DN0_c0_g2_i26:148-1956(+) >TRINITY_DN0_c0_g2_i74.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i74.p1 ORF type:complete len:645 (+),score=146.79 TRINITY_DN0_c0_g2_i74:148-1935(+) >TRINITY_DN0_c0_g2_i68.p1 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i68.p1 ORF type:complete len:295 (+),score=72.21 TRINITY_DN0_c0_g2_i68:195-1079(+) >TRINITY_DN0_c0_g2_i68.p2 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i68.p2 ORF type:3partial len:215 (+),score=52.78 TRINITY_DN0_c0_g2_i68:1432-2073(+) >TRINITY_DN0_c0_g2_i22.p2 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i22.p2 ORF type:complete len:116 (+),score=11.57 TRINITY_DN0_c0_g2_i22:3672-4019(+) >TRINITY_DN0_c0_g2_i7.p5 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i7.p5 ORF type:complete len:74 (+),score=2.15 TRINITY_DN0_c0_g2_i7:179-400(+) >TRINITY_DN0_c0_g2_i18.p4 TRINITY_DN0_c0_g2~~TRINITY_DN0_c0_g2_i18.p4 ORF type:complete len:66 (+),score=1.99 TRINITY_DN0_c0_g2_i18:3933-4130(+)

score=220.37 TRINITY_DN0_c0_g2_i14.p1

915 1 913 A2Q5Q7

901 1 901 940

1.30E-285 989.6 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=219.50 TRINITY_DN0_c0_g2_i29.p1

908 1 906 A2Q5Q7

901 1 901 940

1.10E-279 969.9 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=214.17 TRINITY_DN0_c0_g2_i25.p1

909 1 907 A2Q5Q7

901 1 901 927

4.80E-251 874.8 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=213.30 TRINITY_DN0_c0_g2_i57.p1

902 1 900 A2Q5Q7

901 1 901 927

3.90E-245 855.1 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=205.19 TRINITY_DN0_c0_g2_i16.p1

830 1 826 A2Q5Q7

901 1 820 846

5.80E-203 714.9 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=204.32 TRINITY_DN0_c0_g2_i65.p1

823 1 819 A2Q5Q7

901 1 820 846

3.60E-197 695.7 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=199.81 TRINITY_DN0_c0_g2_i7.p1

821 1 819 A2Q5Q7

901 95 901 846

3.70E-234 818.5 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=196.98 TRINITY_DN0_c0_g2_i41.p1

784 1 781 A0A3B6I0Y8

763 1 736 800

3.90E-185 655.6 Not3 domain-containing protein

score=196.11 TRINITY_DN0_c0_g2_i18.p1

777 1 774 A0A3B6I0Y8

763 1 736 800

2.40E-179 636.3 Not3 domain-containing protein

score=190.38 TRINITY_DN0_c0_g2_i30.p1

777 1 775 A2Q5Q7

901 1 901 940

1.50E-213

750 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=189.50 TRINITY_DN0_c0_g2_i48.p1

770 1 768 A2Q5Q7

901 1 901 940

1.20E-207 730.3 Not CCR4-Not complex component, N-terminal; tRNA-binding arm

score=147.66 TRINITY_DN0_c0_g2_i26.p1

603 1 590 A0A0D2U011 623

1 558 597

9.70E-184 650.6 Not3 domain-containing protein

(28)

Annotation/Functional Analysis

TransDecoder predicted coding regions

Computational Genomics 2020

Week 9

# 28

Query 1 MGASRKLQGEIDRVLKKVQEGVDVFDSIWNKVYDTDNANQKEKFEADLKKEIKKLQRYRD 60 MGASRKLQGEIDRVLKKVQEGV+VFDSIWNKVYDTDNANQKEKFEADLKKEIKKLQRYRD Sbjct 1 MGASRKLQGEIDRVLKKVQEGVEVFDSIWNKVYDTDNANQKEKFEADLKKEIKKLQRYRD 60 Query 61 QIKTWIQSSEIKDKKVSASYEQALLESRKQIEREMERFKVCEKETKTKAFSKEGLVQQPK 120 QIKTWIQSSEIKDKKVSASYEQAL+++RK IEREMERFK+CEKETKTKAFSKEGL QQPK Sbjct 61 QIKTWIQSSEIKDKKVSASYEQALVDARKLIEREMERFKICEKETKTKAFSKEGLGQQPK 120 Query 121 TDPKEKAKSETRDWLNNVVGELESQIDNFEAELEGLFVKKGKTRPPRLTHLETSIVRHKA 180 TDP+EKAKSETRDWLNNVVGELESQIDNFEAELEGL VKKGK RP RLTHLETSI RHKA

Sbjct 121 TDPREKAKSETRDWLNNVVGELESQIDNFEAELEGLTVKKGKNRPSRLTHLETSITRHKA 180

Query 181 HIMKLELILRLLDNDELSPDQVNDVKDFLDDYVERNQEQFDEFSDVDELYSSLPLDKVES 240 HI K EL+LRLLDNDELSP++VNDVKDFLDDYVERNQ+ FDEF DVDELYSSLPLDKV++

Sbjct 181 HIKKCELVLRLLDNDELSPEEVNDVKDFLDDYVERNQDDFDEFDDVDELYSSLPLDKVDT 240 Query 241 LEDLVAIGTPALVVKGVS--PISTGSAV---LSLKTSVATSPTHSSA 282 LEDLV I T V K +S P+ G + LSLKT +A S + S++ Sbjct 241 LEDLVTIPTSVAVAKTISSLPLDEGKTLEDLVTIPTGLAKVAPGLSLKTPLAASASQSAS 300 Query 283 TLPSTAQQVSSVQDQAEETASQDSNSDSAPRTPPSKSGMMGSSVSSVSSAVGSIPTGSNT 342 S +QA+ETASQDSNSD +TPP KSG + SS S+ PTG++ Sbjct 301 ---SQTSEQADETASQDSNSDIVAKTPPPKSGGISSSTST---PTGNH- 342 Query 343 TVATPAR-NLAG----GSTASAILSGPGYIRGVMENAPAAVSSSLANLSSSVQEDDVSSF 397

ATPA N++G + A+AILG +R ++ENA + N S+S +E+++++F

Sbjct 343 --ATPASVNVSGLNLSSAPAAAILPGSNSVRNILENA---IVNQSTSPKEEEINNF 393 Query 398 PGRRSSPALPEIGIGKGIGRGSVVAGLSSPVSGVSLNLTSGNGLPSNGALGTTPVVSDMA 457 P RR SP+L + + + GR S+ S + S+ L SGN + S GALG P S++ Sbjct 394 PTRRPSPSLSDAALVR--GRNSL---SNQATASIPLGSGNTVSSIGALGVVPSASEIT 446 Query 458 KRNLLGADERIGNG--AQPLVSPLSNRMLLQQVSKTMDGIVSSDSNNIGE-GVTAGRTFS 514 KRN+LGAD+R+G+ QPLVSPLSNR++L Q+ K DG S DS+ + E +GR FS Sbjct 447 KRNILGADDRLGSSGMVQPLVSPLSNRLILPQIGKANDGAASVDSSIVNEAAAVSGRVFS 506 Query 515 PSAVSGVQWRPQSPSSFQNQNEMGQFRGRTEIAPDQREKFLQRLQQVQQQGHSNLLGVSH 574 PS V G+QWRP SP FQNQN+ GQ RGRTEIAPDQREKFLQ+ QQVQQQG S LL + Sbjct 507 PSVVPGMQWRPGSP--FQNQNDAGQLRGRTEIAPDQREKFLQKFQQVQQQGPSTLLNMPS 564 Query 575 LPGANHKQFPTQ---QQFNSQSSSLSPQVGLGLGVQSSVGLTAVTSSSLQQQSAIHQ 628 L G NHKQF +Q QQFNSQ SS+S Q +GLG QS L ++S SLQQ +++H Sbjct 565 LVGGNHKQFSSQQQSPLLQQFNSQGSSVSSQSSMGLGAQSP-SLGGISSVSLQQLNSVHS 623 Query 629 QSAQHALMPAGPRDTDAAQVKIEDQQQQHNSSDDVNTELATNPELNKILMNEDDLKTSYM 688 S QH +D D K E+ QQ N D+ TE ++ + K L EDDLK++Y Sbjct 624 PSGQHPFAGVA-KDAD----KFEEHQQHQNFPDESTTESTSSTGIGKNLTVEDDLKSAYA 678 Query 689 ----AGGTGSSKDATQVPRDTDLSPRQPLPFNQSSADLGVIGRRSVPDLGAIGDNLSQST 744 AG + S +A Q RD DLSP QPL NQS+ +LGVIGRR+ +LGAIGD+ S+ Sbjct 679 LDSPAGLSASLPEAAQTFRDIDLSPGQPLQSNQSTGNLGVIGRRNGVELGAIGDSFGASS 738 Query 745 VNNGLMQERLYSLQMLDAAYHRLPQSKDSERAKNYTPRHPTKTPASFPQVQAPIVDNPAF 804 VN+G ++++LY+LQML+AA+ R+PQ +DSER + YTPRHP TP+S+PQVQAPIV+NPAF

Sbjct 739 VNSGGVRDQLYNLQMLEAAHFRMPQPRDSERPRTYTPRHPAITPSSYPQVQAPIVNNPAF 798

Query 805 WERLSLDSVGTDTLFFAFYYQQNTYQQYLAARELKKQSWRYHRKYSTWFQRHEEPKVTTD 864 WERL L+ GTDTLFFAFYYQQNTYQQYLAA+ELKKQSWRYHRKY+TWFQRHEEPKV TD

Sbjct 799 WERLGLEPFGTDTLFFAFYYQQNTYQQYLAAKELKKQSWRYHRKYNTWFQRHEEPKVATD 858

Query 865 EYEQGTYVYFDFHIANDDLNHGWCQRIKTEFTFEYSYLEDELL 907 +YEQGTYVYFDFHIANDDL HGWCQRIK +FTFEY+YLEDEL+

Sbjct 859 DYEQGTYVYFDFHIANDDLQHGWCQRIKNDFTFEYNYLEDELV 901 Query 1 MGASRKLQGEIDRVLKKVQEGVDVFDSIWNKVYDTENANQKEKFEADLKKEIKKLQRYRD 60 MGASRKLQGEIDRVLKKVQEGV+VFDSIWNKVYDT+NANQKEKFEADLKKEIKKLQRYRD Sbjct 1 MGASRKLQGEIDRVLKKVQEGVEVFDSIWNKVYDTDNANQKEKFEADLKKEIKKLQRYRD 60 Query 61 QIKTWIQSSEIKDKKVSASYEQALLDARKIIEREMERFKVCEKETKTKAFSKEGLGQQPK 120 QIKTWIQSSEIKDKKVSASYEQAL+DARK+IEREMERFK+CEKETKTKAFSKEGLGQQPK Sbjct 61 QIKTWIQSSEIKDKKVSASYEQALVDARKLIEREMERFKICEKETKTKAFSKEGLGQQPK 120 Query 121 TDPKEKAKSETRDWLNNVVSELESQVDNFEAEIEGLSFKKGKTRPPRLTHLETSIVRHKA 180 TDP+EKAKSETRDWLNNVV ELESQ+DNFEAE+EGL+ KKGK RP RLTHLETSI RHKA

Sbjct 121 TDPREKAKSETRDWLNNVVGELESQIDNFEAELEGLTVKKGKNRPSRLTHLETSITRHKA 180

Query 181 HIMKLELILRLLDNDELSPDQVNDVRDFLEDYVERNQEQFDEFSDVDELYNTLPLDKVES 240 HI K EL+LRLLDNDELSP++VNDV+DFL+DYVERNQ+ FDEF DVDELY++LPLDKV++

Sbjct 181 HIKKCELVLRLLDNDELSPEEVNDVKDFLDDYVERNQDDFDEFDDVDELYSSLPLDKVDT 240 Query 241 LEDLVAIGPP-ALVKGVTSVP---AAGAVLGLKTSLATSATQLPATSP--STAQQGAS 292 LEDLV I A+ K ++S+P ++ + T LA A L +P ++A Q AS Sbjct 241 LEDLVTIPTSVAVAKTISSLPLDEGKTLEDLVTIPTGLAKVAPGLSLKTPLAASASQSAS 300 Query 293 IQ--DQAEETASQDSNSDVILRTPPSKNGVMGSSVSSSTTAIGSATPAGSNIATAAGNIS 350 Q +QA+ETASQDSNSD++ +TPP K+G +SSST +TP G++ A+ N+S Sbjct 301 SQTSEQADETASQDSNSDIVAKTPPPKSG----GISSST---STPTGNHATPASVNVS 351 Query 351 AHSLVGGPTASAIL--SSPVRGTMDNTTAAASQPPVNLPSSIKEDENATVPNRRPSPALA 408 +L P A+AIL S+ VR ++N VN +S KE+E P RRPSP+L+ Sbjct 352 GLNLSSAP-AAAILPGSNSVRNILENAI---VNQSTSPKEEEINNFPTRRPSPSLS 403 Query 409 DVGLAKAIGRGSAVGGMSSQ-LSGISLSSGNGIPSDAALGGGPTVSDIAKHNILGADERI 467

D L + GR S +S+Q + I L SGN + S ALG P+ S+I K NILGAD+R+

Sbjct 404 DAALVR--GRNS----LSNQATASIPLGSGNTVSSIGALGVVPSASEITKRNILGADDRL 457 Query 468 G-NGSLQPLVSPLSNRMLLQPASRASDGTVSTESSNVGDSTVIGGRVFSPS-VPGVQWKP 525 G +G +QPLVSPLSNR++L +A+DG S +SS V ++ + GRVFSPS VPG+QW+P Sbjct 458 GSSGMVQPLVSPLSNRLILPQIGKANDGAASVDSSIVNEAAAVSGRVFSPSVVPGMQWRP 517 Query 526 HNTGSFPNTNEMGQFRGRTEIAPDQREKFLQRLQQV-QQGHSTLLGVPHLAGANHKQFAT 584 + F N N+ GQ RGRTEIAPDQREKFLQ+ QQV QQG STLL +P L G NHKQF++ Sbjct 518 GS--PFQNQNDAGQLRGRTEIAPDQREKFLQKFQQVQQQGPSTLLNMPSLVGGNHKQFSS 575 Query 585 QPQSSLLQQFNSQSSPVSPQVGLGPGVQ--SLAGATATSSSLQITMHQQSGQHALLSVGP 642 Q QS LLQQFNSQ S VS Q +G G Q SL G ++ S ++H SGQH V Sbjct 576 QQQSPLLQQFNSQGSSVSSQSSMGLGAQSPSLGGISSVSLQQLNSVHSPSGQHPFAGVA- 634 Query 643 KDTDAAHVKVEDQQQHQNPSDDLKTEPATNSGLSKNLMNEDDLKFSYAADTPSGGSGPLT 702 KD D K E+ QQHQN D+ TE +++G+ KNL EDDLK +YA D+P+G S L Sbjct 635 KDAD----KFEEHQQHQNFPDESTTESTSSTGIGKNLTVEDDLKSAYALDSPAGLSASLP 690 Query 703 EAVHEPRDVDLSPRQPLQSNQSSAGLGVIGRRSVSDLGAIGDNLSASTANSGAIQEQLYN 762 EA RD+DLSP QPLQSNQS+ LGVIGRR+ +LGAIGD+ AS+ NSG +++QLYN

Sbjct 691 EAAQTFRDIDLSPGQPLQSNQSTGNLGVIGRRNGVELGAIGDSFGASSVNSGGVRDQLYN 750

Query 763 LQMLEAAFCKLPQPKDSERTKHYIPRHPVKTPPSFPQVPAPVVDNPAFWERLSLEPLGTD 822 LQMLEAA ++PQP+DSER + Y PRHP TP S+PQV AP+V+NPAFWERL LEP GTD

Sbjct 751 LQMLEAAHFRMPQPRDSERPRTYTPRHPAITPSSYPQVQAPIVNNPAFWERLGLEPFGTD 810

Query 823 TLFFAFYYQPNTYQQYLAARELKKQSWRYHRKYSTWFQRHEEPKVTTDEYEQGTYVYFDF 882 TLFFAFYYQ NTYQQYLAA+ELKKQSWRYHRKY+TWFQRHEEPKV TD+YEQGTYVYFDF

Sbjct 811 TLFFAFYYQQNTYQQYLAAKELKKQSWRYHRKYNTWFQRHEEPKVATDDYEQGTYVYFDF 870

Query 883 HVANDDSQNGWCQRIKTEFTFEYLYLEDELV 913 H+ANDD Q+GWCQRIK +FTFEY YLEDELV

Sbjct 871 HIANDDLQHGWCQRIKNDFTFEYNYLEDELV 901

(29)

Transcriptome Assembly

Trinity downstream

• Expression quantification

• salmon (kallisto)

• RSEM

• eXpress

• Super transcripts – use transcript maps to produce gene pseudo sequences

• Coding region determination/best transcript

• longest transcript is not necessarily the best

• transdecoder.pl

• extract the long open reading frames

• Optionally, identify ORFs with homology to known proteins via blast or pfam searches.

• predict the likely coding regions

• report predicted proteins over minimum length

(30)

Transcriptome Assembly

Trinity counting reads with Salmon

• Similar to read mapping , but doesn't actually align the reads

• uses kmers to match reads to transcripts

• first index the reference, now the reference is the transcriptome assembly

Trinity.fast, (renamed to avocado_trinity_200422.fa)

• using the cleaned decontaminated reads, compare to index to get counts

• output is a series of directories with the sample names

• counts are called quant.sf

(31)

Transcriptome Assembly

Trinity counting reads with Salmon

• my job file

# Run salmon to count reads vs de novo transcriptome

# for UNAL computational genomics course, primavera 2020

# Michael Gribskov

indexcommand="salmon index -t ../trinity_out_dir/avocado_trinity_200422.fa -i avocado_trinity_200422.index"

echo $indexcommand

$indexcommand

date

# define the directory where the sample data files reside

data="../../contam/ribosomal"

# process all samples -- in this case all files with the suffix .ribosomal.unconc.1.fastq.gz

# make a list of samples based on the R1 files, R2 file names are created automatically

for r1 in $data/*.ribosomal.unconc.1.fastq.gz; do

# generate r2 name from r1

r2="${r1/\.1\./.2.}"

echo -e "R1:$r1\nR2:$r2"

# generate output file name from r1 by removing directory and suffix

# should generate a name like sample.salmon

out=${r1##\.*/}

out=${out%.rib*}

out="$out.salmon"

command="salmon quant --index avocado_trinity_200422.index --validateMappings \

-l ISF \

-1 $r1 \

-2 $r2 \

-o $out"

echo "command: $command"

$command

(32)

Transcriptome Assembly

Counting reads with Salmon

• quant.sf file

Computational Genomics 2020

Week 7

# 32

(33)

Annotation/Functional Analysis

Quality

• Fraction of reads that map to transcriptome

• Completeness of core transcriptome – BUSCO

• Need the set of reads that were selected during filtering, not the entire

genome/transcriptome

(34)

Annotation/Functional Analysis

Identifying genes

• Comparison to known sequences (extrinsic comparison)

• RNA, same or other species (Blast)

• proteins in other species (Blast)

• Match to orthologous families (e.g., Eggnog)

• Match to functional motifs (e.g., Interpro)

• DNA, coding regions are more conserved than non-coding (LAGAN, tblastn)

• Ab initio prediction (intrinsic comparison)

• intrinsic differences between coding and non-coding

• known sites/signals such as splice junctions and terminators

(35)

Annotation/Functional Analysis

• How do I apply information that I already know about a gene, a protein, or a virus, to a

newly discovered one.

• Find a homolog with known biochemistry

• assume that because of the ancestral relationship the novel gene shares many functional

properties

(36)

Annotation/Functional Analysis

Metaphor

• How do you find the strands of gold in a haystack?

• How good do you have to be?

• average size protein ~300 residues in 3 x 10

10 _{residues (2017)}

• 300/ 3 x 1010 residues ~ 1 in 10

8 ₍₁₀

-8

₎

• average size gene ~10 kbp in 2 x10

12 _{bases (2017)}

• 10

4 _{/2 x 10}

12 _{nucleotides ~ 1 in 2x10}

8 _(5x10

-9

₎

• Database searching succeeds by efficiently reducing the

number of unrelated sequences.

(37)

Annotation/Functional Analysis

Sequence Database Searching

• Related sequences (homologs) have runs of similar bases or residues with few gaps

• insertion deletion mutations are much less common than missense

• constraints on structure and function keep sequences corresponding to proteins, and especially

the interior of proteins, from changing

• Unrelated sequences have some random level of similarity

• truly random matches

• biased composition

• generic patterns of sequences

• codon choice

(38)

Annotation/Functional Analysis

(39)

Annotation/Functional Analysis

BLAST procedure

• Step 1: Compile list of high scoring words from query sequence

• Step 2: Scan database for "hits"

• Step 3: Extend regions with 2 hits into MSPs

(40)

Annotation/Functional Analysis

BLAST

• Maximal Segment Pairs (MSP)

• Highest scoring pair of identical length segments from two sequences

• Local alignment without gaps

• Expected distribution is known!

• In BLAST2, a “diagonal” must have two word hits before extension to MSP is

attempted.

• In principal, must examine diagonal until score drops to zero

• Shortcut, only check until score drops by X

Potential MSP

T G C A A T C G A T C G T C G T C C G T A T A C A

: : : : : :

: : : :

running sum

A G C T C G T G A T C C T G G T G G G A T C G G T

match = +1

mismatch = -1

0 1

2 1 0 0 0 1 2 3

4

3

4 3 4

5 4 3

4 3 2 1 0 0 0

initial identically matching word

(41)

Annotation/Functional Analysis

Statistics

• Sequence matching is not normal, it is extreme!

• Scores follow and extreme value (EVD) or Gumbel distribution

• Z score can't be directly converted to probability

• Whenever you are looking at a distribution of maxima

• longest run of heads in coin toss

• maximum scores for each sequence in database

• Sequence matches are a lot like coin tosses!

PTVQGLRLFE

:: : :

(42)

++--+--+--Annotation/Functional Analysis

Sequence Database Searching

0

0.2

0.4

0.6

0.8

1

0

0.05

0.1

0.15

0.2

0.25

0.3

0

2

4

6

8

10

12

14

16 Extreme Value Distribution

(43)

Annotation/Functional Analysis

BLAST is based on Significant MSPs

• Scoring system

• Must have at least one positive score

• Expected score must be less than zero

• E = f

_i

s

_i

• Probability of an MSP scoring higher than S

• P(MSP>S)  KNe

-S

N = size of data, K and  are constants

(44)

Annotation/Functional Analysis

Sequence database searching

(45)

Annotation/Functional Analysis

Dynamic programming alignments

• Alignment - Provides a one-to-one picture of the residues or bases in the sequences

that correspond

1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.... 46

.||.::. | ..||||:|. .:.|.|.| |:| : |.| . |..|

1 GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK 50

. . . . .

47 ..DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD 94

| .:|.::| || .| .||.. : . :. ...:.:|.: || | ::.

51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100

(46)

Annotation/Functional Analysis

Empirical Scoring Systems - Log-odds matrices

• A log-odds scoring system evaluates the relative probabilities of a match representing

true homology versus the chance that a match occurs at random, i.e. the relative

probability of two models

s

_ij

= ln( q

_ij

/ p

_i

p

_j

)

• Normally, one multiplies probabilities - since these are log probabilities you get the

total probability by adding them up

• When added up over a matching segment, you get the probability that the segment

represents homology relative to the probability that it represents a random match, i.e.

how much more likely than chance is it that the matching segment represents

(47)

Annotation/Functional Analysis

Target frequencies

• Karlin and Altschul showed that for MSPs (Maximum Sequence Pairs), amino acids a

_i

and a

_j

will be aligned with frequency approaching

q

_ij

= p

_i

p

_j

e

-s

where p

_i

and p

_j

are the expected probabilities of observing the amino acid residues

and s is the match score

• A given scoring matrix will try to align the residues according to the above equation, so

q

_ij

are a characteristic set of target frequencies for the scoring matrix S

(48)

Annotation/Functional Analysis

Scoring Systems - Information

• How much information do you need to find something interesting?

• An MSP of about 16 bits is required for significance in a

pairwise comparison of two 250 long sequences

log

₂

( 250

2 _{) = 15.93 bits}

• For a 300 residue protein sequence and the NCBI nr database,

log

₂

(4,100,000,000 x 300 ) = 40.1 bits

• For a 1000 base long DNA sequence and the NCBI nr database,

log

₂

( 33,300,000,000 x 1000) = 44.9 bits

(49)

Annotation/Functional Analysis

Database Searching

• Big problem is database size

• Bigger database means longer search. Alignments are O(n

2)

• Bigger database means worse signal to noise ratio

(50)

Annotation/Functional Analysis

Known genes and proteins

• BLAST (nucleotide) may

not show much

• different GC content

• multiple codons / amino acid

• amino acids are selected not

(51)

Annotation/Functional Analysis

Known Genes/Proteins

• BLASTX

• DNA query (translated)

• protein database

• Finds matches to known proteins and

gene models

• May miss alternative exons

KDEL receptor A

(52)

Annotation/Functional Analysis

BLAST

• Output has three main parts

• diagram of matches

• list of top scores

• alignments

• check

• database

(53)

Annotation/Functional Analysis

(54)

Annotation/Functional Analysis

Database Search – BLAST

(55)

Annotation/Functional Analysis

BLAST

• How good an E value do you need?

• E = number of sequences with score >= observed in random database

• E = 1/number of searches to find a score >= observed

• E = size of db * P(Score>=observed)

• E < 10

-100

_{very good match}

• E < 10

-50

_{strong match, probably part of family}

• E < 10

-20

_{distantly related}

• E > 10

-5

_hmmm

(56)

Annotation/Functional Analysis

Database Search – BLAST

(57)

Annotation/Functional Analysis

Database Search – BLAST

• E=e-77

• same sub-family

(58)

Annotation/Functional Analysis

Database Search – BLAST

• E=e-25

(59)

Annotation/Functional Analysis

Database search – BLAST

• E=e-09

• same superfamily

(60)

Annotation/Functional Analysis

Database search – BLAST

• E=9

• In this case clearly a distant homolog, why is the score so low?

• Possibly

(61)

Genomics - Gene Annotation

Sequence matching - What can go wrong?

• Poor gene model leads to unconvincing match

• Truncation

• Missing exons

• Included introns

• Fused genes

• Pseudogenes

• Relationship to other genes is not clear from score alone

• Look at trees/clustering

• No sufficiently similar sequences are found

• Use motif methods

• Regular expression

• PSSM/HMM

• Structure matching (threading)

(62)

Annotation/Functional Analysis

Database Searching – Orthologs

• Orthologs are generally assumed to be functionally identical (thus all the fuss)

• Usually defined for genomic analysis as mutual best hits in BLAST search

• a is best hit for b, b is best hit for a

• If the automatic gene prediction misses genes this is not reliable

• Stochastic gain (by duplication) and loss (by deletion) of genes makes identifying

orthology problematic

(63)

Genomics - Gene Annotation

Matching to known gene products

• BLAST search finds homologs

• How similar do two sequences need to be to have the same function?

(64)

Genomics - Gene Annotation

(65)

(66)

Genomics - Gene Annotation

Matches to formate dehydrogenase

(67)

Genomics - Gene Annotation

C-terminal binding function

• Transcriptional co-repressor

• Binds to specific C-terminal sequence of E1a

• Defects cause severe homeotic patterning effects in Drosophila

• Initially no detectable NAD

+

_{or DH activity, no known D2 hydroxy-acid substrate}

• Acetylates lysophosphaticic acid in Golgi membrane (induces fission)

• RIBEYE, component of ribbon synapse (photoreceptors), is a splice variant

(68)

Genomics - Gene Annotation

C-terminal Binding Protein

• Functional NAD

+

_{dependent dehydrogenase in vitro (Pyruvate-> lactate)}

• E1A repression is NAD

+

_dependent

(69)

Genomics - Gene Annotation

Database Pollution – A lurking problem

• Sequencing and sequence databases are an experimentally determined. They

therefore contain errors

• Incorrect gene models

• Pieces of transposons

• Fused or artificially fragmented genes

• Matches to common motifs

• Most databases are archival not curatorial

• Only the original source can correct Genbank data – this seldom happens

• Sequencing projects vary widely in their curatorial standards

• TIGR (JCVI), CSHL very good

• DOE generally poor

(70)

Protein Function

Database pollution

• How does it happen?

• Mixed domain proteins and random motifs contribute

BC

B

BC

C

CD

Original match with B may be spurious

(71)

Protein Function

Tree/cluster methods allow us to make better guesses

• A new sequence that falls inside cluster A or B is unambiguous

• A sequence closer to all of group A than any of group B (or vice versa) is also

unambiguous (only if you have all sequences), but less certain

• A sequence that is in between is ambiguous

(72)

Protein Function

Cluster Concept

• Clusters are formed by a criterion distance

• Minimum linkage (single linkage)

• shortest distance to any member of group

• long stringy clusters

• Average linkage (UPGMA)

• average distance to group

• Maximum linkage (complete linkage)

• longest distance to group

• compact clusters

There is no radius that includes all

of B without some of A,

Therefore we are not confident

that the new sequence is part of B

?

A

(73)

Protein Function

Clustering

• Clustering and trees are the same

• Phylogenetic trees are a form of hierarchical clustering

• bottom-up (agglomerative)

A

B C D

E F G H I J

=

Maximal Linkage Clustering

(74)

Clustering

• Different clustering/tree construction methods give different results

(75)

Protein Function

Clustering – requirements

• A set of gold standard sequences

• Good gene models

• Known function

• Make a tree or perform clustering with knowns and unknowns

• Look for monophyletic groups or clades

(76)

(77)

Protein Function

Clustering – Example of a perfect result

(78)

Protein Function

Validating cluster and sequence matching based

predictions

(79)

Protein Function

At AKIN10 and AKin11

• Functional homolog

• Rescue yeast SNF1

deletion

• E-value or score thresholds will

vary with the protein family

(80)

Annotation/Functional Analysis

Interpro

Motif/domain database

• Regular Expression

• PSSM

• HMM

(81)

Annotation/Functional Analysis

Interpro

(82)

Annotation/Functional Analysis

Interpro

• Interpro scan result

(83)

Annotation/Functional Analysis

(84)

Annotation/Functional Analysis

Interpro

(85)

Annotation/Functional Analysis

(86)

Annotation/Functional Analysis

EggNOG

• Automatically clustered orthologous genes

(87)

Annotation/Functional Analysis

(88)

Ab initio gene modeling programs

• Goal, predict the genes structure from DNA sequence alone

• Most common

• FGENESH (much like Gene Mark)

• GeneMark

• Glimmer and derivatives

• Less Used

• Genscan

• Grail

• NNPP

(89)

Markov Models for Gene Prediction

• Genemark models (http://exon.biology.gatech.edu/)

• 3 forward, three reverse reading frames

• 5

th

_{order model (4096 hexamers)}

• Non-coding (homogeneous model)

• Initial, internal, terminal exons,

and single exon genes

• Introns

• Intergenic regions

• Initiation site, termination site

• Splice donor, splice acceptor

(90)

Markov Chains – GeneMark

• E. coli

Genomics – Gene Modeling

P(

ge

n

e|s

eq

u

en

ce

)

regions of interest (ORFs)

(91)

Markov Models

• Markov models are “trained” on known well annotated genes

• Every organism is different!

• How much data might we need?

• 4

6 _{= 2}

12 _{words = 4096 6mer words}

• Assume we want a good count, say at least 10 of each word

• Assume that the rarest word is 0.01 times as frequent as the average

• 4096 * 100 * 10 = 4,096,000 bases

• 4Mb x 7 models (6 exon , 2 intron model, 1 intergenic ) = 48 Mb

• Need well annotated sequence (i.e., gold standard)

• What if some words still don’t appear

• Interpolated model (Glimmer)

• Self-training model

(92)

Markov Chains – GeneMark self-training

• Learning model parameters without knowing anything about the organism!

• Initial Setup

• Donor / acceptor = GT / AG

• Initiation / termination = ATG / TGA, TAG, TAA

• Exon – 5

th

_{order inhomogeneous model derived from non-overlapping ORFs > 1000 bases long}

• even in eukaryotes very long ORFs are likely to be (single exon) genes

• Predict introns and exons, then update the intron/exon models based the 6mer

frequencies of the predicted regions, weighted by the probability the prediction is