RAST Automated Analysis. What is RAST for?

(1)

RAST

Automated

Analysis

Gordon D. Pusch

Fellowship for Interpretation of Genomes

What is RAST for?

• RAST is designed to rapidly call and annotate the genes of

a complete or essentially complete prokaryotic genome

• RAST uses a "Highest Confidence First" assignment

propagation strategy based on manually curated subsystems and subsystem-based protein families that automatically guarantees a high degree of assignment consistency.

• RAST returns an analysis of the genes and subsystems

in your genome, as supported by comparative and other forms of evidence.

(2)

rast.nmpdr.org

The RAST Strategy

• How does RAST work?

RAST applies FIG's "Subsystem Approach" using a

"Highest Reliability First" strategy based on FIG's collection of manually curated Subsystems and subsystem-derived Protein Families (FIGfams).

RAST's subsystem approach automatically ensures a high degree of annotation consistency.

RAST also computes various derived data (sims, BBHs, PCHs, Scenarios, etc.) to support high-throughput genome annotation projects.

RAST Strategy - Calling Genes

• Find RNAs (rRNAs, tRNAs)

• Find gene candidates for "Special Proteins” (selenos, pyrros)

• Find gene candidates for membership in:

• "Universal" FIGfam Protein Families

• FIGfams already seen in the neighboring genomes.

• FIGfams other than those found in the neighboring genomes.

• Repair frameshift errors.

• Promote remaining non-FIGfam gene candidates:

• With similarity to genes in neighbors • Without similarity to genes in neighbors

• Examine suspiciously long gaps for possible "missing" genes previously found in neighboring genomes (AKA "Backfilling").

Gene candidates found during all previous stages become the "training set" for the current stage.

(3)

rast.nmpdr.org

I/O - What input formats does

RAST Accept?

• Sequence data in FASTA format (.fna), and GenBank (.gbk) format, uploaded as plain text files with no special characters, etc.

• RAST does not yet support other upload formats, such as EMBL, GFF3, GTF, etc. (although it can generate output in these formats).

• RAST will reject any file format that is not plain text, e.g. it will not accept genomes encoded as HTML, PDF, RTF, Microsoft Word, etc.

I/O - Genes reannotated or

recalled?

• If you want to keep the original gene coordinates,

then you must upload a GenBank file and select

the "Keep existing gene calls" option.

RAST will then assign functions and perform a

subsystem analysis, without recalling the genes

of your genome.

• RAST cannot preserve existing gene calls if FASTA

contig data are uploaded, because the FASTA

format cannot specify gene locations.

(4)

rast.nmpdr.org

I/O - Viewing Results

• You can browse your results and graphically

compare them to other genomes using the

SEED Viewer

• You can also download the analysis

of your genome in various formats:

• GenBank

• EMBL

• GFF3

• GTF

• SEED genome directory (as tarfile)

Input Data Quality

• What is the poorest quality of data that

RAST can handle?

We recommend mean contig length >2 kbp, with <1% ambiguity characters. If your assembly quality is worse than this, RAST will most likely fail.

It is possible that the metagenomic version of RAST may be able to do something with extremely low quality assemblies; however, MG-RAST is not really designed for this job.

(5)

rast.nmpdr.org

Input Data Quality

• RAST is designed for and performs best on

complete or essentially complete genomes.

• Conversely, RAST's performance degrades

substantially when presented with only a small

fragment of a genome.

¾ Even if you are only interested in a few genes in a small region, it is recommend that you upload as much of your genome as possible, and at minimum 100 kbp of contig data. The probability that RAST will abort with errors increases rapidly below the 100 kbp threshold, and is well in excess of 50% below 40 kbp.

Input Data Quality

• What is meant by "essentially complete" genome?

We consider a genome to be "essentially complete" at about 99% coverage, since beyond that point, the expected number of missing genes due to sequencing gaps has become less than the expected number of "false negatives" from the genefinder.

From Subsystem Analysis standpoint, >99% completeness ⇒point of diminishing returns.

In terms of sequence redundancy: At least 5x coverage for Sanger Sequencing, or at least 10x coverage using 454.

In terms of contig length: At least 70% of the assembled sequence data are in contigs longer than 20 kbp.

(6)

rast.nmpdr.org

Input Sequence Types

• Will RAST handle just a plasmid?

RAST is not designed to handle only plasmids or small fragments. We recommend that you upload the entire genome, even if you intend to only view your plasmid. (Extension of RAST to plasmids proposed)

• What about Eukaryotes?

No — not even small ones, and not even organelles! Currently, RAST requires you to specify whether your genome is a bacterium or archaeon. If you try to submit a eukaryote, RAST will most likely abort with errors.

(Extension of RAST to [called!] eukaryotes proposed)

Input Sequence Types

• What about ESTs?

RAST is not designed to analyze ESTs, and will most likely abort with errors.

You can try submitting EST data to the metagenomic version of RAST — but again, it is not really designed for them.

• What about Metagenomes?

As previously mentioned, there is a special metagenomic version of RAST designed specifically to analyze the sort of massive, low-quality datasets typically generated by

(7)

rast.nmpdr.org

FAQs and Common Problems

• Who do I contact if I have questions about

or problems using RAST?

• All questions or problems regarding RAST

should be sent to

[email protected]

• All questions or problems regarding MG-RAST

should be sent to

[email protected]

FAQs and Common Problems

• Will RAST assemble my reads into contigs?

No. You will need to assemble your reads into contigs yourself, using some other tool.

• Why does RAST complain that it can't find the "phylogenetic neighborhood" of my submission?

Usually, this is because the submitted sequence data are too small.

Experience suggests that RAST needs at least 40 kbp of sequence data to reliably place a submission's phylogenetic neighborhood. (100 kbp is better.)

(8)

rast.nmpdr.org

FAQs and Common Problems

• RAST is complaining about "Duplicate contig IDs,"

but all my contig IDs appear unique to me. What's

going on?

Your contig IDs may contain "whitespace" characters. The FASTA standard specifies no "whitespace" between the ">" symbol and the contig ID, and that everything after the first "whitespace" character is a "comment," and not part of the identifier.

Thus, the first FASTA header below is invalid (no ID, just comment), while the following two will be interpreted as a pair of "duplicate IDs,” that are both named "B.":

> E. coli main chromosome

>B. subtilis main chromosome

>B. subtilis plasmid

FAQs and Common Problems

• Why does RAST complain about "invalid

characters" in my FASTA input file?

Most likely one of two reasons:

Your contig sequences contain characters other than the standard IUPAC ambiguity characters [ACGTUMRWSYKBDHVN] or the "vector masking" character "X.” (E.g., because you uploaded protein, not DNA sequences.)

Your contig file uses nonstandard line terminators, is missing line terminators before or after a record header, or is otherwise malformed in some way.

RAST Automated Analysis. What is RAST for?

RAST

RAST

Automated

Automated

Analysis

Analysis

What is RAST for?

The RAST Strategy

•

How does RAST work?

RAST Strategy - Calling Genes

I/O - What input formats does

RAST Accept?

I/O - Genes reannotated or

recalled?

•

If you want to keep the original gene coordinates,

then you must upload a GenBank file and select

the "Keep existing gene calls" option.

RAST will then assign functions and perform a

subsystem analysis, without recalling the genes

of your genome.

•

RAST cannot preserve existing gene calls if FASTA

contig data are uploaded, because the FASTA

format cannot specify gene locations.

I/O - Viewing Results

•

You can browse your results and graphically

compare them to other genomes using the

SEED Viewer

•

You can also download the analysis

of your genome in various formats:

Input Data Quality

•

What is the poorest quality of data that

RAST can handle?

Input Data Quality

•

RAST is designed for and performs best on

complete or essentially complete genomes.

•

Conversely, RAST's performance degrades

substantially when presented with only a small

fragment of a genome.

Input Data Quality

•

What is meant by "essentially complete" genome?

Input Sequence Types

Input Sequence Types

FAQs and Common Problems

•

Who do I contact if I have questions about

or problems using RAST?

•

All questions or problems regarding RAST

should be sent to

[email protected]

•

All questions or problems regarding MG-RAST

should be sent to

[email protected]

FAQs and Common Problems

FAQs and Common Problems

•

RAST is complaining about "Duplicate contig IDs,"

but all my contig IDs appear unique to me. What's

going on?

FAQs and Common Problems

•

Why does RAST complain about "invalid

characters" in my FASTA input file?

FAQs and Common Problems

•

How do I get a more detailed explanation of

why my job failed?



If the RAST webpage describing the error is