1
SOLiD™ Software Quick Facts
1. What’s New in SOLiD™ Software Data Processing and Data Analysis ... 2
2. SOLiD™ Data Processing Overview ... 4
3. SOLiD™ 4 Result Size and On-Instrument Cluster ... 5
4. SOLiD™ Offline Analysis Cluster Specification... 7
5. SOLiD™ BioScope™ Software ... 9
6. BioScope™ v1.2 Software vs. Corona_Lite for Offline Analysis... 12
2
1. What’s New in SOLiD™ Software Data Processing and
Data Analysis
SOLiD™ Accuracy Enhancer Tool
The SOLiD™ Accuracy Enhancer Tool (SAET) is a spectral alignment error correction tool, which, when applied to raw data generated by the SOLiD™ platform, reduces the color calling error rate by factor of three to five without having the reference genome. Decrease in error rate improves mapping, SNP calling, and de novo assembly results.
Use the SAET to pre-process raw reads before alignment. SAET reduces the color calling error rate by a factor of three to five without having the reference genome. SAET is not recommended for use with whole genome resequencing of large genomes where large > 600 Mbases. The decrease in error rate improves mapping, SNP calling, and de novo assembly results. Mapping becomes more accurate and the number of mapped reads increases by 40-50%.
SAET is available through the BioScope™ Software command line only. History tab
Use the history tab in the Web browser to download or view files generated during a selected plug-in session. The history feature is available only from the BioScope™ Software web browser.
Barcode script
The barcode script runs a given BioScope™ Software data analysis on a set of barcode library read files in batch mode. Use the script to run simultaneous secondary or tertiary tests on barcoded libraries.
BAM format output
BioScope™ Software secondary analysis (mapping and pairing) now produces a BAM file as the main alignment format. Mate pair and paired end analysis
produces a BAM file while a single file conversion is needed for fragment libraries. Depending on the output filter selected, unmapped and secondary alignments can be included.
HD-300 performance
HD-300 increases throughput. BioScope™ Software v1.2 can process larger files at increased throughput, which results in more reads while maintaining current speed and increasing density.
3 SNP detection post-error file changes
BioScope™ Software now provides the option of pre-generated Probe Error files. The feature of providing pre-generated files has the main advantage of time saved in bypassing the regeneration of Probe Error files every time SNP detection is executed.
Using SOLiDBioScope.com™
Users of cloud computing can work with BioScope™ Software using SOLiDBioScope.com™.
This feature provides the following benefits:
• No up front (capital) cost and no cost commitments • Pay as you go; predictable operational costs
• Scale up/down as demand needs
• Upload data via Internet or physical shipments of hard drives For more information, please visit here.
Fusion/Splicing
A fusion junction is a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene. It can occur as the result of a translocation, deletion, or chromosomal inversion. It excludes exon-exon boundaries that arise from alternative splicing for a gene.
ChIP-SEQ
BioScope™ Software v1.2 gives you the option to perform ChIP-Seq resequencing through the BioScope™ Software browser.
Paired-end mapping
Bioscope™ Software v1.2 added support for paired end experiments in addition to mate pair. In a normal paired end experiment, the F5 and F3 tags matched to the genome on different strands and facing toward each other, and satisfies a distance constraint determined by insert size.
Export Configuration
You can now use the web browser as well as the command line to create the configuration file that is required to perform experiments on barcoded libraries.
4
2. SOLiD™ System Data Processing Overview
SOLiD™ Instrument Control Software (ICS)
Provides automated instrument operation and submits data processing jobs for primary analysis: Imaging Bead finding Image registration Filtering Color call
SOLiD™ Experiment Tracking Systems (SETS)
Web-based application that enables users to view on-instrument real-time data and completed run analysis reports from the SOLiD™ Analysis Tools. Secondary analysis (mapping) is enabled on-instrument, although it is recommended to run secondary analysis off-instrument using BioScope™ Software.
SOLiD™ BioScope™ Software
SOLiD™ BioScope™ Software provides a command line and simple web interface that builds configuration files for running application-specific sequence analysis tools. The BioScope™ Software framework enables the user to perform off-instrument secondary and tertiary analyses, and it allows configurable
5
bioinformatics workflows for resequencing (mapping, SNP finding (diBayes), copy number variations, inversions, small indels, large indels) and whole
transcriptome analysis (mapping, splicing/fusion detection, counting, UCSC WIG Files creation). Results can be exported as BAM formats. The resulting industry-standard files from BioScope™ Software can be used with third-party
visualization and analysis software tools.
3. SOLiD™ 4 System Result Size and On-Instrument
Cluster
The SOLiD™ Software analysis pipeline generates results in BAM format containing base space sequence. Mapping results in BAM format contain base space sequences, color space sequences and quality values. The data sizes from the SOLiD™ 4 System, as indicated in the table below, are for fully loaded experiments and directly correlate with the throughput (total bases generated from a run).
Table 1: Result size generated on SOLiD™ 4 System sequencing (Assume 2 slides and deposition densities of 300K beads/panel, 2357 panels)
50 nt
tag/300K/panel,2357 panels
Image data size Primary analysis results size in “.spch” format
Primary analysis data size (flatfile: “.csfasta”, “_QV.qual”, “.stats”)
1 slide – 1 tag 1.84 TB 646 GB 170 GB
1 slide – 2 tags 3.6 TB 1.29 TB 340 GB
2 slides – 1 tag 3.6 TB 1.29 TB 340 GB
2 slides – 2 tags 7.2 TB 2.58 TB 680 GB
Note: Image data is not needed after analysis is complete and intensity files are no longer required for submission into NCBI.
6
http://www.ncbi.nlm.nih.gov/Traces/sra/static/Sequence_Read_Archive_Overvie w.pdf
On-Instrument Compute Cluster Specification
SOLiD™ 4 System Specification
• 19-inch flat screen monitor, mouse and keyboard • Instrument controller
• Head node
• Compute nodes (3) • Shared storage • Gigabit Switch
• Power distribution units (2) Computer Components
• Power cords (2), attached • Instrument Control Software v4.0
• SOLiD™ Experimental Tracking System (SETS) v4.0 Software Suite
• Hardware: Intel® Xeon® processors
• Operating system: Microsoft® Windows® XP Professional, Service Pack 2
• Installed RAM:8 GB
• Hard disk storage: dual 250 GB SATA hard drives (RAID-1) Instrument Controller
• Peripheral: CD-RW/DVD ROM, 19-inch flat screen monitor, keyboard, mouse
• Hardware: Intel® Xeon® Quad Core processors (2) • Operating system: 64-bit LINUX
• Installed RAM: 24 GB Head Node
• Hard disk storage: 6x 1 TB SATA hard drives (RAID-5) • Hardware: Intel® Xeon® Quad Core processors (2) • Operating System: 64-bit LINUX
• Installed RAM: 24 GB Compute Nodes (each)
• Hard disk storage: 2 x1 TB SATA hard drives (RAID-0) • Hard disk storage: 15x 1 TB SATA hard drives
Shared Storage
• RAID-5 w/ hot spare
Gigabit Switch • 16 port Gigabit Switch
Power Distribution Units • Rack PDU (2) with 16 output connector
While full mapping is enabled on-instrument (cluster) for small genomes (i.e. Bacteria), we do not recommend secondary analysis (mapping) on the
instrument cluster. Instead we highly recommend secondary analysis (mapping) using a separate offline analysis cluster. However, for large genomes (i.e.
7
human), a sub-mapping can be performed on-instrument for quality assessment of the run only.
4. Offline Analysis Cluster Specification
The SOLiD™ 4 System allows cycle by cycle auto-export and manual export of primary analysis results to an offline cluster where the secondary analysis can be performed independently of the instrument. This feature enables the instrument to be utilized for additional experiments while secondary analyses are being performed.
Offline Analysis
The offline analysis cluster specification is shown below and represents the minimal and the recommended (Penguin Computing) specification for offline data analysis.
BioScope™ Software Offline Cluster Specification
Minimal Offline Cluster Specification
Minimal Offline Cluster Specification> 2 GHz processors 16 GB RAM
Head Node
100 GB storage local disk space for OS+ software installation > 2 GHz processors
> 2 GHz processors,16+ GB RAM, 8+ cores per node Compute Nodes
(minimum 3 compute
nodes) > 500 GB storage local disk space for OS+ software installation
Gigabit Switch 1 GB Switch
Operation System Centos 4.x, 5.x RedHat 4.x, 5.x
8
9
5. SOLiD™ BioScope™ Software
OVERVIEW
SOLiD™ BioScope™ Software is a framework for bioinformatics tools to perform off-instrument secondary and tertiary analysis. BioScope™ Software consists of a collection of bioinformatics tools that are integrated into a single command line shell. Additionally, BioScope™ Software provides a simple web interface to help build instructions (configuration files) to run these tools. BioScope™ Software includes mapping, pairing, SNP finding, structural variations and whole
transcriptome analyses.
FEATURES
BioScope™ Software 1.2, using a flexible pipeline architecture, enables
maximum flexibility and ease of use for performing high throughput data analysis for SOLiD™ instrument data. It contains following features:
10
ing
s’ diBayes SNP finding algorithm, a component in BioScope™ Software.
new pairing categories D and E and provides reads file that do not pair.
tion for Human, Small indel finding, Large indel finding and Inversion.
o
ovides full for customization to fit particular
experimental designs.
o
ws It can detect such variations at very low coverage, even at 1x.
o
ress s able to detect deletion up to 500bp and insertions up to 20bp.
o
ogress Mapping
BioScope™ Software 1.2 features Applied Biosystems’ newly developed mapping algorithm MaxMapper, which drastically improves the mapp rate and mapping speed over the previous generations of mapping techniques. It produces mapped data that gives excellent sensitivity and specificity in SNP finding using Applied Biosystem
Pairing
BioScope™ Software 1.2 features Applied Biosystems’ updated pairing algorithm which handles mapped data from MaxMapper. It provides
Resequencing Applications
BioScope™ Software 1.2 features five resequencing applications: SNP Finding, Copy Number Varia
SNP Finding
BioScope™ Software features Applied Biosystems’ latest SNP finding technology for SOLiD™ System instrument data which allows sensitive and specific SNP detection even at moderate to low coverage. It allows varying levels of stringency and pr
control over many filters
CNV (Human)
BioScope™ Software features Applied Biosystems’ latest progress on human copy number variation detection technology which allo detecting variations as small as 5KB and as large as the whole chromosome in humans from single sample sequences.
Small Indel
BioScope™ Software features Applied Biosystems’ latest prog on detecting small indel variation. By using a novel split read technique alongside a powerful indel caller on pileups, i
Large Indel
BioScope™ Software features Applied Biosystems’ latest pr on identifying large insertions and deletions compared to a
11
100Kb to accept multiple mate pair
libraries to increase coverage. o
ess o pair technology, it produces a confident list of inversions
gs SC genome browser. WTA also supports experiments fusion detection
o
mber of reads covering a given genome stranded position.
o
uch as le, this plug in
enerates tag counts for annotated regions. o
an -exon boundaries that rise from alternative splicing for a gene.
ed. This is the most One of two exons is retained in mRNAs
onor site) identify large insertions and deletions (indels) from 100bp to
with great confidence. It is able
Inversion
BioScope™ Software features Applied Biosystems’ latest progr on detecting genomic regions that are inverted with respect t reference. Taking advantage of longer insert size and lower inverted dimmer noise of SOLiD™ mate
Whole Transcriptome Analysis
The Whole Transcriptome Analysis (WTA) in BioScope™ Software aligns to a reference genome. With mapping results, it counts the number of ta aligned with exons, and can convert the BAM file to WIG for display of coverage on the UC
in
Create UCSC WIG file
This plug-in takes the BAM file and converts it into WIG files containing coverage data. Coverage is the nu
Count known exons
Given a BAM file of mapped reads and predefined regions (s exons) provided by the user in a .gtf format fi
g
Fusion, splicing
A fusion junction is a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene. It c occur as the result of a translocation, deletion, or chromosomal inversion. A fusion junction excludes exon
a
There are five models of alternative splicing:
• Exon skipping or cassette exon: in this case, an exon may be spliced out of the primary transcript or retain
common mode in mammalian pre-mRNAs • Mutually exclusive exons:
after splicing, but not both.
• Alternative donor site: An alternative 5’ splice junction (d is used, changing the 3’ boundary of the upstream exon.
12
d from exon skipping because the retained sequence is not flanked by introns. If the retained
in
6. B
1.2 So
rona_L
Analysis
• Alternative acceptor site: An alternative 3’ splice junction (acceptor site) is used, changing the 5’ boundary of the downstream exon.
• Intron retention: A sequence may be spliced out as an intron or simply retained. This is distinguishe
intron is in the coding region, the intron must encode amino acids frame with the neighboring exons..
ioScope™
ftware vs. Co
ite for Offline
Offline Data Analysis
BioScope™ 1.2 Corona_Lite v4.2
Analysis execution Integrated command line and Integrated command. Simple web interface Can run batch mode. Programming
language
Java Scripting languages
User Interface for parameter se and analysis
tting
ser interface of ope);
Command line interface
g through
command-line interface GUI (Brow
BioSc
Expanded rich parameter settin
Multiple run combination analysis
Yes Yes
Mapping Algorithm Max Mapper for SOLiD™ 4 System
MapReads Default Mapping Max Mapper: Anchor and
d
Full length with fixed ber of mismatches
setting Exten num
Iterative Mapping User configurable NO/Manual
Multi-threading Yes No
SNP algorithm DiBayes SNP caller
Integrated small indel analysis Yes Yes Integrated large indel analysis Yes No Integrated Human CNV analysis Yes No
13 is Integrated Inversion analys Yes No Integrated Who Transcriptome analysis le Yes No Output results format SAM/BAM output. tches), tional Fasta-like matching output (including unique match and all ma
Op paired mates, GFF v0.2. SNP list text file Stats File
ue New format: add extension information and quality val
Old Stats file
Speed Optim
perfo
ized compute rmance for complex
port complex genome lysis genome analysis Sup ana Warranty Yes No AB support to end users Yes Yes Supported OS Linux CentOS v4.x, PBS
), PBS pro and SGE
Linux, PBS, LSF, SGE (Torque
7. Data Visualization
he BioScope™ Software pipeline will generate reads results in BAM format T
including Base Space Sequence. It can be visualized in browsers such as UCSC nd Broad Institute’s Integrative Genomics Viewer (IGV):
14
echnologies Corporation. All rights reserved. The trademarks mentioned herein are e property of Life Technologies Corporation or their respective owners.
For research use only. Not intended for any animal or human therapeutic or diagnostic use.
© 2010 Life T th